The general health status of the system is the first thing we need to check when something unusual happens or after applying a change. This guidebook describes how to determine whether the system is healthy or not.
Metrics and Dashboard
The statistics of the metrics and the threshold of the errors can vary greatly with different specs and data workflow patterns. The general idea to diagnose is to check the week-over-week or day-over-day comparison of the key metrics to see if there are any significant differences.
Liveliness
For workers, the alluxio_data_access_bytes_count metric will calculate the received read/write requests to the worker. In Prometheus, we can query irate(alluxio_data_access_bytes_count[5m]) to calculate the request-per-second (RPS) of the worker. The RPS should be steady. If the RPS increases rapidly, there’s a chance that the worker may risk capacity problems.
UFS Data Flow
The alluxio_ufs_data_access metric will record the read/write data flow to the UFS on workers.
The alluxio_ufs_error metric will record the error codes of API access for each type of UFS. If the metric increases, there must be something wrong with the access to the UFS. You can use the error_code label to filter out the expected errors. As an example, a "no such file" in S3 will appear as a 404 error code. Note the error code value for the same type of error can vary between different types of UFS.
Cache Hit Rate
For workers, the alluxio_cached_data_read_bytes_total and alluxio_ufs_data_access_bytes_total metrics can calculate the cache hit percentage. To calculate the cache hit rate per second, we should use irate function on Prometheus, and then use sum to remove the unused labels.
The cache hit rate of a single worker will be:
sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) / (sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) + sum by (instance) (irate(alluxio_ufs_data_access_bytes_total{job="worker"}[5m])))
The cache hit rate overall (already integrates in the dashboard):
# check the readiness of the workers in kubernetes# ensure that the value for READY is 100%. a "Running" STATUS may not always mean the worker is healthy$kubectlgetpod-lapp.kubernetes.io/component=workerNAMEREADYSTATUSRESTARTSAGEalluxio-worker-59476bf8c5-lg4sc1/1Running046halluxio-worker-59476bf8c5-vg6lc1/1Running046h# or use the following one-liner to get the readiness percentage# if there are multiple alluxio clusters, use `app.kubernetes.io/instance=alluxio` to specify the cluster$kubectlgetpod-lapp.kubernetes.io/component=worker-ojsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}'|awk'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'2ready/2expected=100%
ETCD Readiness
# check the readiness of the etcd cluster# use `app.kubernetes.io/instance=alluxio` to select the etcd cluster integrated with alluxio$kubectlgetpod-l'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio'NAMEREADYSTATUSRESTARTSAGEalluxio-etcd-01/1Running046halluxio-etcd-11/1Running046halluxio-etcd-21/1Running046h# use the following one-liner to get the readiness percentage$kubectlgetpod-l'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio'-ojsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}'|awk'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'3ready/3expected=100%# check if the etcd can handle the normal I/O operation# the output will be the registration of the alluxio workers. DO NOT change or write values in this way$kubectlexec-italluxio-etcd-0-cetcd--bash-c'ETCDCTL_API=3 etcdctl get --keys-only --prefix /ServiceDiscovery'/ServiceDiscovery/default-alluxio/worker-1b4bad64-f195-46a8-be5f-27825a8100a4/ServiceDiscovery/default-alluxio/worker-49ca0d6f-7a0a-482f-bb17-5550bd602a02
UFS Readiness
$./bin/alluxioexecufsTest--paths3://your_bucket/test_pathRunningtest:createAtomicTest...Passedthetest!time:5205msRunningtest:createEmptyTest...Passedthetest!time:4076msRunningtest:createNoParentTest...Passedthetest!time:0msRunningtest:createParentTest...Passedthetest!time:4082ms...Runningtest:listStatusS3RootTest...Passedthetest!time:543msRunningtest:createFileLessThanOnePartTest...Passedthetest!time:3551msRunningtest:createAndAbortMultipartFileTest...Passedthetest!time:6227msTestscompletedwith0failed.# This example shows that the client will start two threads to write and then read a 512MB file to UFS, and print the test results$./bin/alluxioexecufsIOTest--paths3://test_bucket/test_path--io-size512m--threads2{"readSpeedStat":{"mTotalDurationSeconds":483.992,"mTotalSizeBytes":1073741824,"mMaxSpeedMbps":2.0614405926641703,"mMinSpeedMbps":1.0578687251028942,"mAvgSpeedMbps":1.5596546588835323,"mClusterAvgSpeedMbps":2.1157374502057884,"mStdDev":0.7096324729606261,"className":"alluxio.stress.worker.IOTaskSummary$SpeedStat" },"writeSpeedStat":{"mTotalDurationSeconds":172.136,"mTotalSizeBytes":1073741824,"mMaxSpeedMbps":3.5236227246137433,"mMinSpeedMbps":2.974392340939722,"mAvgSpeedMbps":3.2490075327767327,"mClusterAvgSpeedMbps":5.948784681879444,"mStdDev":0.3883645287295896,"className":"alluxio.stress.worker.IOTaskSummary$SpeedStat" },"errors": [ ],"baseParameters":{"mCluster":false,"mClusterLimit":0,"mClusterStartDelay":"10s","mJavaOpts": [ ],"mProfileAgent":"","mBenchTimeout":"20m","mId":"local-task-0","mIndex":"local-task-0","mDistributed":false,"mStartMs":-1,"mInProcess":true,"mHelp":false },"points": [ {"mMode":"WRITE","mDurationSeconds":145.305,"mDataSizeBytes":536870912,"className":"alluxio.stress.worker.IOTaskResult$Point" }, {"mMode":"WRITE","mDurationSeconds":172.136,"mDataSizeBytes":536870912,"className":"alluxio.stress.worker.IOTaskResult$Point" }, {"mMode":"READ","mDurationSeconds":248.37,"mDataSizeBytes":536870912,"className":"alluxio.stress.worker.IOTaskResult$Point" }, {"mMode":"READ","mDurationSeconds":483.992,"mDataSizeBytes":536870912,"className":"alluxio.stress.worker.IOTaskResult$Point" } ],"parameters":{"className":"alluxio.stress.worker.UfsIOParameters","mThreads":2,"mDataSize":"512m","mPath":"s3://test_bucket/test_path","mUseUfsConf":false,"mConf":{} },"className":"alluxio.stress.worker.IOTaskSummary"}
Logs
Alluxio Processes
# use `kubectl logs <pod-name>` to get all logs of a pod# on bare metal, the logs are located in `logs/<component>.log`, e.g.: `logs/worker.log`# use `grep` to filter the specific log level# when an expection is caught, there will be additional information after the WARN/ERROR log,# use `-A` to print more lines$kubectllogsalluxio-worker-59476bf8c5-lg4sc|grep-A1'WARN\|ERROR'2024-07-0417:29:53,499ERRORHdfsUfsStatusIterator-Failedtolistthepathhdfs://localhost:9000/java.net.ConnectException:CallFrommyhost/192.168.1.10tolocalhost:9000failedonconnectionexception:java.net.ConnectException:Connectionrefused; Formoredetailssee:http://wiki.apache.org/hadoop/ConnectionRefused# use `kubectl logs -p` to get logs from a previous failed container$kubectllogs-palluxio-worker-59476bf8c5-lg4sc|grep-A1'WARN\|ERROR'2024-07-0514:22:28,290 [QuotaCoordinatorTimerThread] ERROR quota.QuotaCoordinator (QuotaCoordinator.java:lambda$new$0) - Failed to run quota janitor jobjava.io.IOException: Failed to poll quota usage from worker WorkerInfo{id=7128818659502632567, identity=worker-7128818659502632567, address=WorkerNetAddress{host=sunbowen, containerHost=, rpcPort=64750, dataPort=64751, webPort=64753, domainSocketPath=, secureRpcPort=0, httpServerPort=0}, lastContactSec=283, state=LIVE, capacityBytes=1073741824, usedBytes=0, startTimeMs=1720160253072, capacityBytesOnTiers={MEM=1073741824}, usedBytesOnTiers={MEM=0}, version=3.x-7.0.0-SNAPSHOT, revision=fca83a4688187055d7abfd3a7d710b83a6b62ac6}
Quick Recovery
ETCD Failure
Alluxio has implemented automatic emergency handling for ETCD failures. When ETCD fails, Alluxio will tolerate the failure for a period (usually 24 hours) to ensure normal IO operations. During this time, it is expected that the administrator will resolve the ETCD issue.
When ETCD fails, it can typically be resolved by restarting the ETCD pod. In special cases, if the ETCD node cannot be restarted via Kubernetes, you can follow these steps to rebuild and restore the ETCD service:
Shut down the cluster: kubectl delete -f alluxio-cluster.yaml
Delete the original ETCD PVs and PVCs
Backup the original ETCD PV directory in case we need to recover any data later
Check the ETCD data folder of the ETCD PV on all physical machines and delete its contents
Create three new ETCD PVs. If the PVs are manually created by administrator, it's recommended to add nodeAffinity section in the PV to ensure that each PV corresponds to one ETCD node and one physical machine
Restart the cluster: kubectl create -f alluxio-cluster.yaml
If the UFS mount points are not defined in UnderFileSystem resources, re-execute the alluxio mount add command to re-mount the UFS
Worker Failure
Alluxio can handle worker failures without interrupting I/O operations through optimization features. Generally, a small number of worker failures will not affect Alluxio's functionality.
When a Worker fails, Kubernetes will restart the worker pod automatically without affecting the cached data.
Coordinator Failure
The coordinator handles the load job and management services. It persists job metadata to the metastore, which can be recovered upon restart. If the persisted data gets corrupted, the unfinished jobs are lost and need to be resubmitted.
When the coordinator fails, Kubernetes will restart the coordinator pod automatically.