First, ensure that the operator has started successfully and that the collectinfo controller is running. Below is the information of the operator, showing that the collectinfo controller is running. If the collectinfo controller does not exist, it means the current version of the operator does not support the collectinfo feature. Please upgrade the operator version.
kubectl get pod -n alluxio-operator
NAME READY STATUS RESTARTS AGE
alluxio-cluster-controller-8656d54bc-x6ms6 1/1 Running 0 19s
alluxio-collectinfo-controller-cc49c56b6-wlw8k 1/1 Running 0 19s
alluxio-csi-controller-84df9646fd-4d5b8 2/2 Running 0 19s
alluxio-csi-nodeplugin-fcp7b 2/2 Running 0 19s
alluxio-csi-nodeplugin-t59ch 2/2 Running 0 19s
alluxio-csi-nodeplugin-vbq2q 2/2 Running 0 19s
alluxio-ufs-controller-57fbdf8d5c-2f79l 1/1 Running 0 19s
Ensure that the Alluxio cluster has started successfully. Assume the Alluxio cluster is in the default namespace. Below is the information of the Alluxio cluster, showing that all components of the Alluxio cluster are running.
kubectl get pod
NAME READY STATUS RESTARTS AGE
alluxio-coordinator-0 1/1 Running 0 2m17s
alluxio-etcd-0 1/1 Running 0 2m17s
alluxio-etcd-1 1/1 Running 0 2m17s
alluxio-etcd-2 1/1 Running 0 2m17s
alluxio-monitor-grafana-9fd587b4f-mnczs 1/1 Running 0 2m17s
alluxio-monitor-prometheus-6b55c568b8-sfp96 1/1 Running 0 2m17s
alluxio-worker-779d87567f-95wls 1/1 Running 0 2m17s
alluxio-worker-779d87567f-sgh4b 1/1 Running 0 2m17s
Collecting Information
Create a simple YAML file to collect information using default values (for a complete configuration, refer to Detailed Configuration).
Assuming the Alluxio cluster is in the default namespace, create collectinfo.yaml with the following contents.
Create the collectinfo to start collecting information.
kubectl apply -f collectinfo.yaml
You can check the progress of information collection by viewing the status of collectinfo. The following shows the collection of five types of information, all completed successfully.
kubectl get collectinfo
NAME COMPLETED FAILED STATE AGE
example-collectinfo 5/5 0/5 Completed 6m16s
The collectinfo will create multiple jobs in the alluxio-operator namespace to collect information about the Alluxio cluster. By default, all information is collected, and you can see that there are five jobs running: config, hardware, license, logs, and metrics.
kubectl get job -n alluxio-operator
NAME COMPLETIONS DURATION AGE
example-collectinfo-config-job 1/1 4s 4m10s
example-collectinfo-hardware-job 1/1 5s 4m10s
example-collectinfo-license-job 1/1 10s 4m10s
example-collectinfo-logs-job 1/1 5s 4m10s
example-collectinfo-metrics-job 1/1 4s 4m10s
Collecting information failed
The following shows a failure in information collection, with four types of information failing to be collected.
kubectl get collectinfo
NAME COMPLETED FAILED STATE AGE
example-collectinfo 1/5 4/5 Failed 52s
Check the job information of the collectinfo. You can see that only the hardware job of collectinfo succeeded, while the other jobs failed.
kubectl get job -n alluxio-operator
NAME COMPLETIONS DURATION AGE
example-collectinfo-config-job 0/1 4m18s 4m18s
example-collectinfo-hardware-job 1/1 5s 4m18s
example-collectinfo-license-job 0/1 4m18s 4m18s
example-collectinfo-logs-job 0/1 4m18s 4m18s
example-collectinfo-metrics-job 0/1 4m18s 4m18s
You can always download the collection results regardless of the success or failure of the collectinfo operation.
The results will contain an error.log if there are any failures for debugging.
Downloading Results
There are two ways to download the results of information: kubectl cp and kubectl port-forward.
Results contain the following types of information:
config: The configuration files in Alluxio's conf/ directory, such as alluxio-site.properties and alluxio-env.sh.
hardware: CPU and memory details for each Kubernetes node. Hardware specifications for coordinator, worker, fuse and operator components.
license: The license information of the Alluxio cluster, including the type, productionId and licenseVersion. And vCPU, memory and storage are being used.
logs: Logs from coordinator, worker, fuse, etcd and operator components. Supports tailing logs to show a specified number of lines from the end.
metrics: Allows setting duration and step to define the time range and sampling interval for metrics (collects all metrics).
kubectl cp
Use kubectl cp to copy the collected information to your local machine.
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
# The name of the collectinfo <COLLECTINFO_NAME> needs to be filled in
kubectl cp alluxio-operator/${COLLECTINFO_CONTROLLER_NAME}:/tmp/output/<COLLECTINFO_NAME> output -n alluxio-operator
kubectl port-forward
Use port-forward to map the port of the collectinfo controller to your local machine. Map the remote collectinfo controller's port 80 to your local port 28080.
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
kubectl port-forward -n alluxio-operator ${COLLECTINFO_CONTROLLER_NAME} 28080:80
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
name: example-collectinfo
spec:
alluxio:
# The namespace where the Alluxio cluster is located
namespace: "default"
# Information collection types, including config, hardware, license, logs, metrics
# If not specified or set to "all", all information is collected
# To specify multiple types:
# type:
# - config
# - hardware
type:
- all
# The number of retries. If a collection job fails, it will retry the specified number of times
backoffLimit: 2
logs:
# The number of logs to collect, e.g., 100 means collecting the latest 100 logs
tail: 100
# Metrics information: "duration" indicates the collection duration, and "step" indicates the collection interval
# The example below means collecting all metrics from now to the past two hours, with a one-minute interval between metrics
metrics:
# The duration of metrics collection, e.g., 2h means collecting metrics from now to the past two hours
duration: 2h
# The interval of metrics collection, e.g., 1m means collecting metrics every minute
step: 1m
# The image used for executing the collection task, defaulting to the Alluxio operator's image
# Can be left unspecified to use the default value
image: "<ALLUXIO_OPERATOR_IMAGE>"
imagePullPolicy: "Always"
# Resource limits for collecting information
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"