# Troubleshooting

This guide provides a structured approach to troubleshooting your Alluxio cluster on Kubernetes. It covers everything from initial health checks to detailed diagnostics and recovery procedures for common issues.

## 1. Initial Health Checks

When you encounter an issue, start with these high-level checks to quickly assess the overall health of your Alluxio cluster and its dependencies.

### Checking Component Status

Verify that all Alluxio and etcd pods are running and in a `READY` state. A `Running` status is not sufficient; the `READY` column should show that all containers in the pod are healthy.

Check the readiness of Alluxio coordinator pods:

```shell
kubectl -n alx-ns get pod -l app.kubernetes.io/component=coordinator
```

Check the readiness of Alluxio worker pods:

```shell
kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
```

```console
NAME                                      READY   STATUS    RESTARTS   AGE
alluxio-cluster-worker-59476bf8c5-lg4sc   1/1     Running   0          46h
alluxio-cluster-worker-59476bf8c5-vg6lc   1/1     Running   0          46h
```

Check the readiness of Alluxio FUSE pods (both DaemonSet and CSI):

```shell
kubectl -n alx-ns get pod -l 'app.kubernetes.io/component in (fuse, csi-fuse)'
```

```console
NAME                                           READY   STATUS    RESTARTS   AGE
alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0   1/1     Running   0          57m
```

Check the readiness of the integrated etcd cluster:

```shell
kubectl -n alx-ns get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio-cluster'
```

```console
NAME                     READY   STATUS    RESTARTS   AGE
alluxio-cluster-etcd-0   1/1     Running   0          46h
alluxio-cluster-etcd-1   1/1     Running   0          46h
alluxio-cluster-etcd-2   1/1     Running   0          46h
```

You can also use this one-liner to get a readiness percentage for a specific component:

Example for workers:

```shell
kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
```

```console
2 ready / 2 expected = 100 %
```

### Verifying UFS Connectivity

Ensure that Alluxio can communicate with the underlying storage system (UFS).

Run the `ufsTest` to check basic UFS operations:

```shell
./bin/alluxio exec ufsTest --path s3://your_bucket/test_path
```

```console
Running test: createAtomicTest...
Passed the test! time: 5205ms
...
Tests completed with 0 failed.
```

Run the `ufsIOTest` to check UFS read/write throughput:

This example writes and reads a 512MB file with two threads:

```shell
./bin/alluxio exec ufsIOTest --path s3://test_bucket/test_path --io-size 512m --threads 2
```

```console
{
  "readSpeedStat" : { ... },
  "writeSpeedStat" : { ... },
  "errors" : [ ],
  ...
}
```

A successful test with no errors indicates that the UFS is reachable and configured correctly.

### Monitoring Key Metrics via Dashboard

The Grafana dashboard provides the quickest way to spot anomalies. Focus on these key areas:

* **Liveliness**: Look at the requests-per-second (RPS) for workers (`irate(alluxio_data_access_bytes_count[5m])`) and FUSE (`alluxio_fuse_result`). A sudden, unexpected spike or drop can indicate a problem.
* **UFS Data Flow**: Monitor the `alluxio_ufs_data_access` and `alluxio_ufs_error` metrics. An increase in errors is a clear sign of UFS connectivity or permission issues.
* **Cache Hit Rate**: A sudden drop in the overall cache hit rate can indicate that workers are unhealthy or that the data access pattern has changed unexpectedly.

## 2. Gathering Detailed Diagnostic Information

If initial health checks don't reveal the issue, you'll need to dig deeper by inspecting logs and collecting a full diagnostic snapshot.

### Inspecting Logs

#### Alluxio Process Logs

Check the logs for specific error messages.

Get all logs from a specific pod (e.g., a worker):

```shell
kubectl -n alx-ns logs alluxio-cluster-worker-59476bf8c5-lg4sc
```

Filter for WARN or ERROR messages and show the line after the match:

```shell
kubectl -n alx-ns logs alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0 | grep -A 1 'WARN\|ERROR'
```

```console
2024-07-04 17:29:53,499 ERROR HdfsUfsStatusIterator - Failed to list the path hdfs://localhost:9000/
java.net.ConnectException: Call From myhost/192.168.1.10 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
```

Check logs from a previously failed container:

```shell
kubectl -n alx-ns logs -p alluxio-cluster-worker-59476bf8c5-lg4sc
```

#### Kubernetes CSI Driver Logs

If you suspect issues with FUSE pod mounting, check the logs from the Alluxio CSI node plugin running on the same Kubernetes node as your application pod.

```shell
# 1. Get the node name where your application or FUSE pod is running
PODNS=alx-ns POD=alluxio-cluster-fuse-acee53e8f0a9-3gjbrdekk0
NODE_NAME=$(kubectl get pod -o jsonpath='{.spec.nodeName}' -n ${PODNS} ${POD})

# 2. Find the Alluxio CSI node plugin pod on that node
CSI_POD_NAME=$(kubectl -n alluxio-operator get pod -l app.kubernetes.io/component=csi-nodeplugin --field-selector spec.nodeName=${NODE_NAME} -o jsonpath='{..metadata.name}')

# 3. Get the logs from the csi-nodeserver container
kubectl -n alluxio-operator logs -c csi-nodeserver ${CSI_POD_NAME}
```

### Generating a Diagnostic Snapshot

For complex issues, the `doctor` tool gathers a comprehensive snapshot of your cluster's state, which is invaluable for offline analysis or for sharing with support.

The snapshot includes:

* Configuration files
* Hardware specifications of Kubernetes nodes
* Data from etcd (mounts, quotas, etc.)
* Logs from all Alluxio components
* Metrics over a specified time range
* Job service history
* Meta information

#### Prerequisites

Ensure the `alluxio-doctor-controller` is running in the operator's namespace. If it's not present, you may need to upgrade the Alluxio Operator.

```shell
kubectl -n alluxio-operator get pod -l app.kubernetes.io/component=doctor-controller
```

```console
NAME                                             READY   STATUS    RESTARTS   AGE
alluxio-doctor-controller-cc49c56b6-wlw8k        1/1     Running   0          19s
```

#### Collecting the Snapshot

By default, a `Doctor` is created with your cluster, performing a daily snapshot. You can also trigger a one-time collection or customize the schedule.

To trigger a one-time collection, create a YAML file (`collect-now.yaml`):

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: one-time-snapshot
  namespace: alx-ns
spec:
  scheduled:
    enabled: false # A single run is triggered when enabled is false
```

Then apply it:

```shell
kubectl apply -f collect-now.yaml
```

#### Accessing and Downloading the Snapshot

The collected snapshots (as `.tar.gz` files) are stored in a volume mounted to the doctor-controller pod.

```shell
# 1. Get the doctor controller pod name
DOCTOR_NAME=$(kubectl -n alluxio-operator get pod -l app.kubernetes.io/component=doctor-controller -o jsonpath="{.items[0].metadata.name}")

# 2. List the snapshots inside the doctor controller
kubectl -n alluxio-operator exec -it ${DOCTOR_NAME} -- ls /data/doctor

# 3. Copy the snapshots to your local machine
kubectl -n alluxio-operator cp ${DOCTOR_NAME}:/data/doctor ./doctor
```

#### Snapshot Upload

By default, the diagnostic package collected by Doctor is stored only in the `doctor-controller` related storage within the cluster. Additionally, an optional feature is provided to automatically upload these diagnostic results to a dedicated S3 bucket maintained and analyzed by Alluxio.

You can enable this feature if you would like the Alluxio team to assist with analyzing your cluster's health.

How to enable:

1. Contact the Alluxio support team to request activation of this feature.
2. We will provide you with a dedicated `awsKey` and `awsSecret`.
3. Configure these credentials in the `spec.upload` field.

Once configured, the collected results will be securely uploaded. Our team will be able to access these reports to help you with data analysis, issue tracking, and proactively troubleshoot and prevent potential cluster problems.

#### Detailed Configuration

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-doctor
  # Must be in the same namespace as the Alluxio cluster
  namespace: alx-ns
spec:
  # Configure scheduled collection
  scheduled:
    # false: run once immediately; true: enable scheduled execution
    enabled: false
    # Only effective when enabled: true
    cron: "0 0 * * *"
    # Only effective when enabled: true
    timeZone: "Asia/Shanghai"
    # Only effective when enabled: true, retention period for collected results
    expiration: "720h"

  # Information types to collect
  type:
    - all

  # Log collection configuration
  logs:
    # Collect logs from the past 1 hour (3600 seconds)
    # Note: If left empty, defaults to collecting logs from the past 1 day (86400 seconds)
    sinceSeconds: 3600
    # tail: 1000 # Alternatively, use tail to collect the last 1000 lines
    # sinceTime: "2025-11-12T06:00:00Z" # Alternatively, use sinceTime to collect logs after a specific time point

  # Metrics collection configuration
  metrics:
    # Collect metrics from the past 24 hours
    duration: 24h
    # Sampling interval of 5 minutes
    step: 5m

  # (Optional) Upload configuration
  # upload:
  #   account: test
  #   productionId: xxx # Optional
  #   awsKey: <alluxio-provided-key>
  #   awsSecret: <alluxio-provided-secret>
```

#### Field Details

`spec.scheduled`

Used to configure scheduled collection tasks.

* `enabled` (boolean):
  * `false` (default): Runs the collection once immediately after the CRD is applied (`kubectl apply`).
  * `true`: Enables scheduled collection, which will run cyclically according to the `cron` expression.
* `cron` (string): Only effective when `enabled: true`. Defines the Cron expression for the task execution schedule.
  * Example: `"0 0 * * *"` means run at midnight every day.
  * Syntax reference: [Kubernetes Cron Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#schedule-syntax)
* `timeZone` (string): Only effective when `enabled: true`. Defines the time zone for the `cron` expression.
  * Example: `"Asia/Shanghai"`
  * Time zone reference: [Kubernetes Time Zones](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#time-zones)
* `expiration` (string): Only effective when `enabled: true`. Defines the retention period for the results of each scheduled collection, which will be automatically cleaned up after expiration.
  * Example: `"720h"` (retains for 30 days)
  * Format reference: [Go Duration Format](https://golang.org/pkg/time/#ParseDuration)

`spec.type`

Defines the types of information to collect.

* `all` (default): If the `type` field is not specified, or is set to `all`, all the following information will be collected.
* Specify one or more types:
  * `config`: Alluxio configuration information.
  * `hardware`: Node hardware information.
  * `etcd`: Information stored in Etcd.
  * `job-history`: Job history records.
  * `logs`: Logs from components like Coordinator, Worker, FUSE, etc.
  * `metrics`: Prometheus metrics from Alluxio components.

`spec.logs`

Defines the scope of log collection, based on Kubernetes `PodLogOptions`. Reference: [Kubernetes PodLogOptions](https://pkg.go.dev/k8s.io/api/core/v1#PodLogOptions)

* **Default behavior:** If the `logs` field is **left completely empty** or not configured, the Operator will default to collecting logs from the **past 1 day (86400 seconds)**.
* `tail` :
  * Collect the last N lines of each container's log.
  * Example: `1000` (collects the last 1000 lines).
* `sinceSeconds` :
  * Collect logs from N seconds ago to the present.
  * Example: `3600` (collects the past 1 hour).
* `sinceTime` :
  * Collect logs after a specific absolute time point (RFC3339 format).
  * Example: `"2025-11-12T06:00:00Z"` (collects logs after 6:00 AM UTC on November 12th).
* **Collection Rules:**
  1. `tail` has the highest priority. When `tail` is set, the `sinceSeconds` and `sinceTime` fields are ignored.
  2. Only when `tail` is not set, `sinceSeconds` or `sinceTime` will take effect, but only one of them can be set.
  3. Default behavior (if all are left empty), it will collect logs from the past 1 day (86400 seconds).

`spec.metrics`

Defines the scope for Prometheus metrics collection.

* `duration` (string): The duration to look back from "now" for collection.
  * Example: `24h` (collects metrics from the past 24 hours).
* `step` (string): The time interval for collection (sampling precision).
  * Example: `1m` (samples once per minute).

`spec.upload` (Optional)

Provides a way to automatically upload the collected results (compressed package) to an AWS S3 bucket provided by Alluxio.

* **Default behavior:** If the `upload` block is not configured, the collected information ("doctor" results) will **not** be automatically uploaded. You will need to manually retrieve the results from the Operator's Pod or its mounted storage. (*Note: You need to confirm and specify the local storage location of the results based on your actual situation*).
* `account` (string): Required. Account information provided by Alluxio.
* `productionId` (string): Optional. A product ID used to identify the customer environment.
* `awsKey` (string): Required. AWS Access Key provided by Alluxio.
* `awsSecret` (string): Required. AWS Secret Key provided by Alluxio.

## 3. Common Issues and Recovery Procedures

Here are step-by-step guides for recovering from common component failures.

### Coordinator Failure

The coordinator runs the Alluxio job service, which is responsible for managing asynchronous jobs like distributed loads for preloading cache. It persists its job history and can recover its state upon restart. Kubernetes will automatically restart a failed coordinator pod. If the job history is corrupted, any unfinished jobs may be lost and will need to be resubmitted.

### Worker Failure

Alluxio is designed to be resilient to worker failures. If a worker pod fails, Kubernetes will restart it automatically. Data stored in cache on that worker will be lost, but this will not cause I/O operations to fail (though it may temporarily decrease performance as data is re-fetched).

### FUSE Failure

If a FUSE pod crashes or becomes unresponsive, it will be automatically restarted by its controller (either a DaemonSet or the CSI driver). If a FUSE pod is hung, you can force a restart:

```shell
# Manually delete the pod to trigger a restart
kubectl delete pod <fuse-pod-name>
```

### ETCD Failure

Alluxio has a grace period (typically 24 hours) to tolerate an etcd failure without disrupting I/O. If the integrated etcd cluster fails and cannot be recovered by a simple pod restart, you may need to rebuild it.

**Warning: This is a destructive operation and should only be performed as a last resort.**

1. **Shut down the Alluxio cluster:** `kubectl delete -f alluxio-cluster.yaml`
2. **Delete the original etcd PVCs:** `kubectl -n alx-ns delete pvc -l app.kubernetes.io/component=etcd`
3. **Clear etcd data on nodes:** Manually log into each Kubernetes node that hosted an etcd pod and delete the contents of the host path directory used by the etcd PV.
4. **Recreate the cluster:** `kubectl create -f alluxio-cluster.yaml`. The operator will provision a new, empty etcd cluster.
5. **Re-mount UFS paths:** If you were not using the `UnderFileSystem` CRD to manage mounts, you will need to manually re-add them using `alluxio fs mount`.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/administration/troubleshooting-alluxio.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
