Troubleshooting Alluxio

This guide provides a structured approach to troubleshooting your Alluxio cluster on Kubernetes. It covers everything from initial health checks to detailed diagnostics and recovery procedures for common issues.

1. Initial Health Checks

When you encounter an issue, start with these high-level checks to quickly assess the overall health of your Alluxio cluster and its dependencies.

Checking Component Status

Verify that all Alluxio and etcd pods are running and in a READY state. A Running status is not sufficient; the READY column should show that all containers in the pod are healthy.

# Check the readiness of Alluxio coordinator pods
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=coordinator

# Check the readiness of Alluxio worker pods
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
NAME                                      READY   STATUS    RESTARTS   AGE
alluxio-cluster-worker-59476bf8c5-lg4sc   1/1     Running   0          46h
alluxio-cluster-worker-59476bf8c5-vg6lc   1/1     Running   0          46h

# Check the readiness of Alluxio FUSE pods (both DaemonSet and CSI)
$ kubectl -n alx-ns get pod -l 'app.kubernetes.io/component in (fuse, csi-fuse)'
NAME                                           READY   STATUS    RESTARTS   AGE
alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0   1/1     Running   0          57m

# Check the readiness of the integrated etcd cluster
$ kubectl -n alx-ns get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio-cluster'
NAME                     READY   STATUS    RESTARTS   AGE
alluxio-cluster-etcd-0   1/1     Running   0          46h
alluxio-cluster-etcd-1   1/1     Running   0          46h
alluxio-cluster-etcd-2   1/1     Running   0          46h

You can also use this one-liner to get a readiness percentage for a specific component:

# Example for workers:
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
2 ready / 2 expected = 100 %

Verifying UFS Connectivity

Ensure that Alluxio can communicate with the underlying storage system (UFS).

Run the ufsTest to check basic UFS operations:

$ ./bin/alluxio exec ufsTest --path s3://your_bucket/test_path
Running test: createAtomicTest...
Passed the test! time: 5205ms
...
Tests completed with 0 failed.

Run the ufsIOTest to check UFS read/write throughput:

# This example writes and reads a 512MB file with two threads
$ ./bin/alluxio exec ufsIOTest --path s3://test_bucket/test_path --io-size 512m --threads 2
{
  "readSpeedStat" : { ... },
  "writeSpeedStat" : { ... },
  "errors" : [ ],
  ...
}

A successful test with no errors indicates that the UFS is reachable and configured correctly.

Monitoring Key Metrics via Dashboard

The Grafana dashboard provides the quickest way to spot anomalies. Focus on these key areas:

  • Liveliness: Look at the requests-per-second (RPS) for workers (irate(alluxio_data_access_bytes_count[5m])) and FUSE (alluxio_fuse_result). A sudden, unexpected spike or drop can indicate a problem.

  • UFS Data Flow: Monitor the alluxio_ufs_data_access and alluxio_ufs_error metrics. An increase in errors is a clear sign of UFS connectivity or permission issues.

  • Cache Hit Rate: A sudden drop in the overall cache hit rate can indicate that workers are unhealthy or that the data access pattern has changed unexpectedly.

2. Gathering Detailed Diagnostic Information

If initial health checks don't reveal the issue, you'll need to dig deeper by inspecting logs and collecting a full diagnostic snapshot.

Inspecting Logs

Alluxio Process Logs

Check the logs for specific error messages.

# Get all logs from a specific pod (e.g., a worker)
$ kubectl -n alx-ns logs alluxio-cluster-worker-59476bf8c5-lg4sc

# Filter for WARN or ERROR messages and show the line after the match
$ kubectl -n alx-ns logs alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0 | grep -A 1 'WARN\|ERROR'
2024-07-04 17:29:53,499 ERROR HdfsUfsStatusIterator - Failed to list the path hdfs://localhost:9000/
java.net.ConnectException: Call From myhost/192.168.1.10 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

# Check logs from a previously failed container
$ kubectl -n alx-ns logs -p alluxio-cluster-worker-59476bf8c5-lg4sc

Kubernetes CSI Driver Logs

If you suspect issues with FUSE pod mounting, check the logs from the Alluxio CSI node plugin running on the same Kubernetes node as your application pod.

# 1. Get the node name where your application or FUSE pod is running
$ PODNS=alx-ns POD=alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0
$ NODE_NAME=$(kubectl get pod -o jsonpath='{.spec.nodeName}' -n ${PODNS} ${POD})

# 2. Find the Alluxio CSI node plugin pod on that node
$ CSI_POD_NAME=$(kubectl -n alluxio-operator get pod -l app.kubernetes.io/component=csi-nodeplugin --field-selector spec.nodeName=${NODE_NAME} -o jsonpath='{..metadata.name}')

# 3. Get the logs from the csi-nodeserver container
$ kubectl -n alluxio-operator logs -c csi-nodeserver ${CSI_POD_NAME}

Generating a Diagnostic Snapshot

For complex issues, the collectinfo tool gathers a comprehensive snapshot of your cluster's state, which is invaluable for offline analysis or for sharing with support.

The snapshot includes:

  • Configuration files

  • Hardware specifications of Kubernetes nodes

  • Data from etcd (mounts, quotas, etc.)

  • Logs from all Alluxio components

  • Metrics over a specified time range

  • Job service history

Prerequisites

Ensure the alluxio-collectinfo-controller is running in the operator's namespace. If it's not present, you may need to upgrade the Alluxio Operator.

$ kubectl -n alluxio-operator get pod -l app.kubernetes.io/component=alluxio-collectinfo-controller
NAME                                             READY   STATUS    RESTARTS   AGE
alluxio-collectinfo-controller-cc49c56b6-wlw8k   1/1     Running   0          19s

Collecting the Snapshot

By default, a CollectInfo resource is created with your cluster, performing a daily snapshot. You can also trigger a one-time collection or customize the schedule.

To trigger a one-time collection, create a YAML file (collect-now.yaml):

apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: one-time-snapshot
  namespace: alx-ns
spec:
  scheduled:
    enabled: false # A single run is triggered when enabled is false

Then apply it:

$ kubectl apply -f collect-now.yaml

Accessing and Downloading the Snapshot

The collected snapshots (as .tar.gz files) are stored in a volume mounted to the coordinator pod.

# 1. Get the coordinator pod name
$ COORDINATOR_NAME=$(kubectl -n alx-ns get pod -l app.kubernetes.io/component=coordinator -o jsonpath="{.items[0].metadata.name}")

# 2. List the snapshots inside the coordinator
$ kubectl -n alx-ns exec -it ${COORDINATOR_NAME} -- ls /mnt/alluxio/metastore/collectinfo/

# 3. Copy the snapshots to your local machine
$ kubectl -n alx-ns cp ${COORDINATOR_NAME}:/mnt/alluxio/metastore/collectinfo/ ./snapshots-output

3. Common Issues and Recovery Procedures

Here are step-by-step guides for recovering from common component failures.

Coordinator Failure

The coordinator handles metadata operations. It persists its state to the metastore and can recover upon restart. Kubernetes will automatically restart a failed coordinator pod. If job history is corrupted, unfinished jobs may be lost and need to be resubmitted.

Worker Failure

Alluxio is designed to be resilient to worker failures. If a worker pod fails, Kubernetes will restart it automatically. Data stored in cache on that worker will be lost, but this will not cause I/O operations to fail (though it may temporarily decrease performance as data is re-fetched).

FUSE Failure

If a FUSE pod crashes or becomes unresponsive, it will be automatically restarted by its controller (either a DaemonSet or the CSI driver). If a FUSE pod is hung, you can force a restart:

# Manually delete the pod to trigger a restart
$ kubectl delete pod <fuse-pod-name>

ETCD Failure

Alluxio has a grace period (typically 24 hours) to tolerate an etcd failure without disrupting I/O. If the integrated etcd cluster fails and cannot be recovered by a simple pod restart, you may need to rebuild it.

Warning: This is a destructive operation and should only be performed as a last resort.

  1. Shut down the Alluxio cluster: kubectl delete -f alluxio-cluster.yaml

  2. Delete the original etcd PVCs: kubectl -n alx-ns delete pvc -l app.kubernetes.io/component=etcd

  3. Clear etcd data on nodes: Manually log into each Kubernetes node that hosted an etcd pod and delete the contents of the host path directory used by the etcd PV.

  4. Recreate the cluster: kubectl create -f alluxio-cluster.yaml. The operator will provision a new, empty etcd cluster.

  5. Re-mount UFS paths: If you were not using the UnderFileSystem CRD to manage mounts, you will need to manually re-add them using alluxio fs mount.

Last updated