Troubleshooting Alluxio
1. Initial Health Checks
Checking Component Status
# Check the readiness of Alluxio coordinator pods
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=coordinator
# Check the readiness of Alluxio worker pods
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
NAME READY STATUS RESTARTS AGE
alluxio-cluster-worker-59476bf8c5-lg4sc 1/1 Running 0 46h
alluxio-cluster-worker-59476bf8c5-vg6lc 1/1 Running 0 46h
# Check the readiness of Alluxio FUSE pods (both DaemonSet and CSI)
$ kubectl -n alx-ns get pod -l 'app.kubernetes.io/component in (fuse, csi-fuse)'
NAME READY STATUS RESTARTS AGE
alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0 1/1 Running 0 57m
# Check the readiness of the integrated etcd cluster
$ kubectl -n alx-ns get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio-cluster'
NAME READY STATUS RESTARTS AGE
alluxio-cluster-etcd-0 1/1 Running 0 46h
alluxio-cluster-etcd-1 1/1 Running 0 46h
alluxio-cluster-etcd-2 1/1 Running 0 46hVerifying UFS Connectivity
Monitoring Key Metrics via Dashboard
2. Gathering Detailed Diagnostic Information
Inspecting Logs
Alluxio Process Logs
Kubernetes CSI Driver Logs
Generating a Diagnostic Snapshot
Prerequisites
Collecting the Snapshot
Accessing and Downloading the Snapshot
3. Common Issues and Recovery Procedures
Coordinator Failure
Worker Failure
FUSE Failure
ETCD Failure
Last updated