# Diagnostic Snapshot

To diagnosis an issue or problem of Alluxio, a "Diagnostic Snapshot" including various cluster information is required, obtained by running the `collectinfo` function.

The results of `collectinfo` contain the following types of information:

* config: The configuration files in Alluxio's `conf/` directory
* hardware: CPU and memory and hardware specifications of K8s nodes which are running coordinator, worker, fuse, and operator components.
* etcd: Information stored in etcd within the Alluxio cluster, including mount, quota, priority, TTL, workers, and license information.
* logs: Logs from coordinator, worker, fuse, and operator components. Supports tailing logs to show a specified number of lines from the end.
* metrics: Allows setting duration and step to define the time range and sampling interval for metrics.
* job history: The historical records of load, free, and copy jobs within the Alluxio cluster, including detailed job information and status.

## Prerequisite

First, ensure that the `operator` has started successfully and that the `alluxio-collectinfo-controller` is running. Below is the information of the operator, showing that the `alluxio-collectinfo-controller` is running. If the `alluxio-collectinfo-controller` does not exist, it means the current version of the `operator` does not support the `collectinfo` feature. Please upgrade the `operator` version.

```console
kubectl -n alluxio-operator get pod
NAME                                             READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-8656d54bc-x6ms6       1/1     Running   0          19s
alluxio-collectinfo-controller-cc49c56b6-wlw8k   1/1     Running   0          19s
alluxio-csi-controller-84df9646fd-4d5b8          2/2     Running   0          19s
alluxio-csi-nodeplugin-fcp7b                     2/2     Running   0          19s
alluxio-csi-nodeplugin-t59ch                     2/2     Running   0          19s
alluxio-csi-nodeplugin-vbq2q                     2/2     Running   0          19s
alluxio-ufs-controller-57fbdf8d5c-2f79l          1/1     Running   0          19s
```

Ensure that the Alluxio cluster has started successfully. Below is the information of the Alluxio cluster, showing that all components of the Alluxio cluster are running.

```console
kubectl -n alx-ns get pod
NAME                                  READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          2m17s
alluxio-cluster-etcd-0                        1/1     Running   0          2m17s
alluxio-cluster-etcd-1                        1/1     Running   0          2m17s
alluxio-cluster-etcd-2                        1/1     Running   0          2m17s
alluxio-cluster-grafana-9fd587b4f-mnczs       1/1     Running   0          2m17s
alluxio-cluster-prometheus-6b55c568b8-sfp96   1/1     Running   0          2m17s
alluxio-cluster-worker-779d87567f-95wls       1/1     Running   0          2m17s
alluxio-cluster-worker-779d87567f-sgh4b       1/1     Running   0          2m17s
```

## Collecting Information

`collectinfo` tool offers two collection methods: scheduled collection and one-time collection.

* Scheduled collection allows you to set the collection interval, such as daily, weekly, monthly, etc.
* One-time collection triggers an immediate collection task.

Once an Alluxio cluster is created, a scheduled collection is automatically generated for collecting cluster information.

By default, the scheduled collection is daily. You can check the progress of the collecting below command.\
The `LASTSCHEDULEDTIME` field indicates the next scheduled time for the collection task,\
while the `LASTSUCCESSFULTIME` field represents the most recent successful collection.

```console
kubectl -n alx-ns get collectinfo
NAME                                  LASTSCHEDULETIME       LASTSUCCESSFULTIME     Age
alluxio-cluster-alluxio-collectinfo                                                 0s
alluxio-cluster-alluxio-collectinfo                                                 0s
alluxio-cluster-alluxio-collectinfo   2025-01-23T00:00:00Z                          0s
alluxio-cluster-alluxio-collectinfo   2025-01-24T00:00:00Z                          2m
alluxio-cluster-alluxio-collectinfo   2025-01-24T00:00:00Z   2025-01-23T00:00:43Z   1d
alluxio-cluster-alluxio-collectinfo   2025-01-25T00:00:00Z   2025-01-23T00:00:43Z   1d
alluxio-cluster-alluxio-collectinfo   2025-01-25T00:00:00Z   2025-01-24T00:00:44Z   2d
alluxio-cluster-alluxio-collectinfo   2025-01-26T00:00:00Z   2025-01-24T00:00:44Z   2d
alluxio-cluster-alluxio-collectinfo   2025-01-26T00:00:00Z   2025-01-25T00:00:45Z   3d
```

After a collection is completed, the collected results will be saved to the `coordinator` pod to persist data.\
The collection results will be deleted based on the value of the `expiration` field (see [Detailed Configuration](#detailed-configuration) for more info).\
By default, the `expiration` value is set to `720h` or 30 days, meaning the collected results will be deleted after 30 days.

You can access the collected results by entering the `coordinator` pod. Use the following command to access the `coordinator` pod:

```console
# set an environment variable to save the name of the coordinator pod
COORDINATOR_NAME=$(kubectl -n alx-ns get pod -l app.kubernetes.io/name=alluxio,app.kubernetes.io/component=coordinator -o jsonpath="{.items[0].metadata.name}")

➜ kubectl -n alx-ns exec -it ${COORDINATOR_NAME} -- bash
alluxio@alluxio-cluster-coordinator-0:/$

# move to the collectinfo directory
alluxio@alluxio-cluster-coordinator-0:~$ cd /mnt/alluxio/metastore/collectinfo/

# view the collection results
alluxio@test-collect-alluxio-coordinator-0:/mnt/alluxio/metastore/collectinfo$ ls
alluxio-cluster-alluxio-collectinfo_alx-ns_2025-01-23-00-00-00.tar.gz  alluxio-cluster-alluxio-collectinfo_alx-ns_2025-01-25-00-00-00.tar.gz
alluxio-cluster-alluxio-collectinfo_alx-ns_2025-01-24-00-00-00.tar.gz
```

The collection results are stored in a `tar.gz` file. The file name includes the collection task name, the Alluxio cluster's namespace, and the collection time.

If you want to modify the collection content or schedule, you can do the following.

1. Delete the existing `collectinfo` resource

```console
# delete collectinfo
kubectl -n alx-ns delete collectinfo <COLLECTINFO_NAME>
```

2. Create a new scheduled collection yaml

Assuming the Alluxio cluster is in the `alx-ns` namespace, create `collectinfo.yaml` with the following contents for a daily scheduled collection with the cron expression `"0 0 * * *"` representing a daily execution at midnight.\
You can refer to the [Cron schedule syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#schedule-syntax) for more information on how to build the cron expression.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
  namespace: alx-ns
spec:
  scheduled:
    enabled: true
    cron: "0 0 * * *"
```

The [detailed configuration](#detailed-configuration) lists the possible fields and their description.

3. Apply new scheduled collection

```console
kubectl apply -f collectinfo.yaml
```

## Downloading Results

Use `kubectl cp` to copy the collected information to your local machine.

```console
# Set an environment variable to save the name of the coordinator pod
COORDINATOR_NAME=$(kubectl -n alx-ns get pod -l app.kubernetes.io/name=alluxio,app.kubernetes.io/component=coordinator -o jsonpath="{.items[0].metadata.name}")

# The name of the coordinator pod needs to be filled in
kubectl -n alx-ns cp ${COORDINATOR_NAME}:/mnt/alluxio/metastore/collectinfo/ collectinfo-output
```

## Detailed Configuration

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
  namespace: alx-ns
spec:
  scheduled:
    # Whether to enable scheduled collection. Set true to enable, which will collect information at the specified cron schedule.
    # Set false to disable scheduled collection, which will trigger an immediate collection task.
    enabled: false
    # Cron expression used to define the scheduled collection time.
    # The following example schedules a collection task to run at midnight every day.
    # For detailed syntax of cron expressions, refer to: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#schedule-syntax
    cron: "0 0 * * *"
    # Time zone used for scheduling the collection task.
    # The following example sets the collection task to use the Shanghai time zone.
    # For a list of valid time zone values, refer to: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#time-zones
    timezone: "Etc/UTC"
    # Expiration time for the collected data. Collection results older than the specified expiration time will be deleted.
    # The following example keeps the collection results for 720 hours (30 days). Results older than 720 hours will be deleted.
    # For more details on time durations, refer to: https://golang.org/pkg/time/#ParseDuration
    expiration: "720h"
  # Information collection types, including config, hardware, etcd, job-history, logs, metrics
  # If not specified or set to "all", all information is collected
  # To specify multiple types:
  # type:
  #   - config
  #   - hardware
  type:
    - all
  logs:
    # The number of logs to collect, e.g., 100 means collecting the latest 100 logs
    tail: 100
  # Metrics information: "duration" indicates the collection duration, and "step" indicates the collection interval
  # The example below means collecting all metrics from now to the past two hours, with a one-minute interval between metrics
  metrics:
    # The duration of metrics collection, e.g., 2h means collecting metrics from now to the past 24 hours
    duration: 24h
    # The interval of metrics collection, e.g., 1m means collecting metrics every minute
    step: 1m
```
