System Health Check & Quick Recovery

Health Check

The general health status of the system is the first thing we need to check when something unusual happens or after applying a change. This guidebook describes how to determine whether the system is healthy or not.

Metrics and Dashboard

The statistics of the metrics and the threshold of the errors can vary greatly with different specs and data workflow patterns. The general idea to diagnose is to check the week-over-week or day-over-day comparison of the key metrics to see if there are any significant differences.

Liveliness

For workers, the alluxio_data_access_bytes_count metric will calculate the received read/write requests to the worker. In Prometheus, we can query irate(alluxio_data_access_bytes_count[5m]) to calculate the request-per-second (RPS) of the worker. The RPS should be steady. If the RPS increases rapidly, there’s a chance that the worker may risk capacity problems.

For FUSE processes, the alluxio_fuse_result metric can be used to calculate the RPS of each FUSE method. The alluxio_fuse_concurrency metric can also reflect the load of a FUSE process.

UFS Data Flow

The alluxio_ufs_data_access metric will record the read/write data flow to the UFS on workers. This also records data flow for FUSE processes if the UFS fallback feature is enabled.

The alluxio_ufs_error metric will record the error codes of API access for each type of UFS. If the metric increases, there must be something wrong with the access to the UFS. You can use the error_code label to filter out the expected errors. As an example, a "no such file" in S3 will appear as a 404 error code. Note the error code value for the same type of error can vary between different types of UFS.

Cache Hit Rate

For workers, the alluxio_cached_data_read_bytes_total and alluxio_ufs_data_access_bytes_total metrics can calculate the cache hit percentage. To calculate the cache hit rate per second, we should use irate function on Prometheus, and then use sum to remove the unused labels.

The cache hit rate of a single worker will be:

sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) / (sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) + sum by (instance) (irate(alluxio_ufs_data_access_bytes_total{job="worker"}[5m])))

The cache hit rate overall (already integrates in the dashboard):

sum(irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) + sum(irate(alluxio_ufs_data_access_bytes_total{job="worker"}[5m])))

General Status

Alluxio Process Readiness

# check the readiness of the workers in kubernetes
# ensure that the value for READY is 100%. a "Running" STATUS may not always mean the worker is healthy
$ kubectl get pod -l app.kubernetes.io/component=worker
NAME                              READY   STATUS    RESTARTS   AGE
alluxio-worker-59476bf8c5-lg4sc   1/1     Running   0          46h
alluxio-worker-59476bf8c5-vg6lc   1/1     Running   0          46h

# check the readiness of the fuse processes
# the selector will selector both csi fuse and daemonSet fuse
$ kubectl get pod -l 'app.kubernetes.io/component in (fuse, csi-fuse)'
NAME                                                   READY   STATUS    RESTARTS   AGE
alluxio-fuse-ip-10-0-5-202.ec2.internal-4e4c24a40cba   1/1     Running   0          57m

# or use the following one-liner to get the readiness percentage
# if there are multiple alluxio clusters, use `app.kubernetes.io/instance=alluxio` to specify the cluster
$ kubectl get pod -l app.kubernetes.io/component=worker -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
2 ready / 2 expected = 100 %

ETCD Readiness

# check the readiness of the etcd cluster
# use `app.kubernetes.io/instance=alluxio` to select the etcd cluster integrated with alluxio
$ kubectl get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio'
NAME             READY   STATUS    RESTARTS   AGE
alluxio-etcd-0   1/1     Running   0          46h
alluxio-etcd-1   1/1     Running   0          46h
alluxio-etcd-2   1/1     Running   0          46h

# use the following one-liner to get the readiness percentage
$ kubectl get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio' -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
3 ready / 3 expected = 100 %

# check if the etcd can handle the normal I/O operation
# the output will be the registration of the alluxio workers. DO NOT change or write values in this way
$ kubectl exec -it alluxio-etcd-0 -c etcd -- bash -c 'ETCDCTL_API=3 etcdctl get --keys-only --prefix /ServiceDiscovery'
/ServiceDiscovery/default-alluxio/worker-1b4bad64-f195-46a8-be5f-27825a8100a4

/ServiceDiscovery/default-alluxio/worker-49ca0d6f-7a0a-482f-bb17-5550bd602a02

Logs

Alluxio Processes

# use `kubectl logs <pod-name>` to get all logs of a pod
# on bare metal, the logs are located in `logs/<component>.log`, e.g.: `logs/worker.log`

# use `grep` to filter the specific log level
# when an expection is caught, there will be additional information after the WARN/ERROR log,
# use `-A` to print more lines
$ kubectl logs alluxio-fuse-ip-10-0-5-202.ec2.internal-4e4c24a40cba | grep -A 1 'WARN\|ERROR'
2024-07-04 17:29:53,499 ERROR HdfsUfsStatusIterator - Failed to list the path hdfs://localhost:9000/
java.net.ConnectException: Call From myhost/192.168.1.10 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

# use `kubectl logs -p` to get logs from a previous failed container
$ kubectl logs -p alluxio-worker-59476bf8c5-lg4sc | grep -A 1 'WARN\|ERROR'
2024-07-05 14:22:28,290 [QuotaCoordinatorTimerThread] ERROR quota.QuotaCoordinator (QuotaCoordinator.java:lambda$new$0) - Failed to run quota janitor job
java.io.IOException: Failed to poll quota usage from worker WorkerInfo{id=7128818659502632567, identity=worker-7128818659502632567, address=WorkerNetAddress{host=sunbowen, containerHost=, rpcPort=64750, dataPort=64751, webPort=64753, domainSocketPath=, secureRpcPort=0, httpServerPort=0}, lastContactSec=283, state=LIVE, capacityBytes=1073741824, usedBytes=0, startTimeMs=1720160253072, capacityBytesOnTiers={MEM=1073741824}, usedBytesOnTiers={MEM=0}, version=3.x-7.0.0-SNAPSHOT, revision=fca83a4688187055d7abfd3a7d710b83a6b62ac6}

Kubernetes CSI Driver

# check the logs from csi node
# the csi node will manage the fuse pods and mount points on the local node, so we need to check the one on the same node with the fuse pod/user's pod
# 1. get the node name of the fuse pod/user's pod
$ kubectl get pod alluxio-fuse-ip-10-0-5-202.ec2.internal-4e4c24a40cba -owide
NAME                                                   READY   STATUS    RESTARTS   AGE   IP          NODE                         NOMINATED NODE   READINESS GATES
alluxio-fuse-ip-10-0-5-202.ec2.internal-4e4c24a40cba   1/1     Running   0          66m   10.0.5.97   ip-10-0-5-202.ec2.internal   <none>           <none>

# 2. get the csi node with the node name
$ kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=csi-nodeplugin --field-selector spec.nodeName=ip-10-0-5-202.ec2.internal
NAME                           READY   STATUS    RESTARTS   AGE
alluxio-csi-nodeplugin-5vvdg   2/2     Running   0          47h

# 3. get the logs from the csi node
$ kubectl logs alluxio-csi-nodeplugin-5vvdg -n alluxio-operator -c csi-nodeserver

# or use the following one-liner to get the logs of the csi node
# the `PODNS` and `POD` environment variables need to be set properly
$ PODNS=default POD=alluxio-fuse-ip-10-0-5-202.ec2.internal-4e4c24a40cba
$ kubectl logs -n alluxio-operator -c csi-nodeserver $(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=csi-nodeplugin --field-selector spec.nodeName=$(kubectl get pod -o jsonpath='{.spec.nodeName}' -n ${PODNS} ${POD}) -o jsonpath='{..metadata.name}')
...
{"level":"info","time":"2024-08-09T04:06:19.986Z","caller":"csi/driver.go:94","msg":"GRPC respond","response":{"capabilities":[{"Type":{"Rpc":{"type":1}}}]}}
{"level":"info","time":"2024-08-09T04:06:19.986Z","caller":"csi/driver.go:89","msg":"GRPC call","method":"/csi.v1.Node/NodeGetCapabilities","request":{}}

Quick Recovery

ETCD Failure

Alluxio has implemented automatic emergency handling for ETCD failures. When ETCD fails, Alluxio will tolerate the failure for a period (usually 24 hours) to ensure normal IO operations. During this time, it is expected that the administrator will resolve the ETCD issue.

When ETCD fails, it can typically be resolved by restarting the ETCD pod. In special cases, if the ETCD node cannot be restarted via Kubernetes, you can follow these steps to rebuild and restore the ETCD service:

Shut down the cluster: kubectl delete -f alluxio-cluster.yaml
Delete the original ETCD PVs and PVCs
Backup the original ETCD PV directory in case we need to recover any data later
Check the ETCD data folder of the ETCD PV on all physical machines and delete its contents
Create three new ETCD PVs. If the PVs are manually created by administrator, it's recommended to add nodeAffinity section in the PV to ensure that each PV corresponds to one ETCD node and one physical machine

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ip-10-0-4-212.ec2.internal

Restart the cluster: kubectl create -f alluxio-cluster.yaml
If the UFS mount points are not defined in UnderFileSystem resources, re-execute the alluxio mount add command to re-mount the UFS

Worker Failure

Alluxio can handle worker failures without interrupting I/O operations through optimization features. Generally, a small number of worker failures will not affect Alluxio's functionality.

When a Worker fails, Kubernetes will restart the worker pod automatically without affecting the cached data.

FUSE Failure

If the Fuse container crashes, the internal mechanisms of the cluster will automatically restart and recover it. The mount points within the container will also be automatically restored once the Fuse container starts up again.

If the Fuse container becomes unresponsive for any reason and you wish to restore it to a normal state, you can directly execute kubectl delete pod alluxio-fuse-xxx to delete the Fuse container and wait for the automatic recovery mechanism to complete its work.

Coordinator Failure

The coordinator handles the load job and management services. It persists job metadata to the metastore, which can be recovered upon restart. If the persisted data gets corrupted, the unfinished jobs are lost and need to be resubmitted.

When the coordinator fails, Kubernetes will restart the coordinator pod automatically.

Last updated 10 months ago