# Collecting Cluster Information

## Collecting cluster information

First, ensure that the operator has started successfully and that the `collectinfo` controller is running. Below is the information of the operator, showing that the `collectinfo` controller is running. If the `collectinfo` controller does not exist, it means the current version of the operator does not support the `collectinfo` feature. Please upgrade the operator version.

```console
kubectl get pod -n alluxio-operator
NAME                                             READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-8656d54bc-x6ms6       1/1     Running   0          19s
alluxio-collectinfo-controller-cc49c56b6-wlw8k   1/1     Running   0          19s
alluxio-csi-controller-84df9646fd-4d5b8          2/2     Running   0          19s
alluxio-csi-nodeplugin-fcp7b                     2/2     Running   0          19s
alluxio-csi-nodeplugin-t59ch                     2/2     Running   0          19s
alluxio-csi-nodeplugin-vbq2q                     2/2     Running   0          19s
alluxio-ufs-controller-57fbdf8d5c-2f79l          1/1     Running   0          19s
```

Ensure that the Alluxio cluster has started successfully. Assume the Alluxio cluster is in the `default` namespace. Below is the information of the Alluxio cluster, showing that all components of the Alluxio cluster are running.

```console
kubectl get pod 
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-coordinator-0                         1/1     Running   0          2m17s
alluxio-etcd-0                                1/1     Running   0          2m17s
alluxio-etcd-1                                1/1     Running   0          2m17s
alluxio-etcd-2                                1/1     Running   0          2m17s
alluxio-monitor-grafana-9fd587b4f-mnczs       1/1     Running   0          2m17s
alluxio-monitor-prometheus-6b55c568b8-sfp96   1/1     Running   0          2m17s
alluxio-worker-779d87567f-95wls               1/1     Running   0          2m17s
alluxio-worker-779d87567f-sgh4b               1/1     Running   0          2m17s
```

### Collecting Information

Create a simple YAML file to collect information using default values (for a complete configuration, refer to [Detailed Configuration](#detailed-configuration)).

Assuming the Alluxio cluster is in the `default` namespace, create `collectinfo.yaml` with the following contents.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
spec:
  alluxio:
    namespace: "default"
```

Create the `collectinfo` to start collecting information.

```console
kubectl apply -f collectinfo.yaml
```

You can check the progress of information collection by viewing the status of `collectinfo`. The following shows the collection of five types of information, all completed successfully.

```console
kubectl get collectinfo
NAME                  COMPLETED   FAILED   STATE       AGE
example-collectinfo   5/5         0/5      Completed   6m16s
```

The `collectinfo` will create multiple jobs in the `alluxio-operator` namespace to collect information about the Alluxio cluster. By default, all information is collected, and you can see that there are five jobs running: config, hardware, license, logs, and metrics.

```console
kubectl get job -n alluxio-operator
NAME                               COMPLETIONS   DURATION   AGE
example-collectinfo-config-job     1/1           4s         4m10s
example-collectinfo-hardware-job   1/1           5s         4m10s
example-collectinfo-license-job    1/1           10s        4m10s
example-collectinfo-logs-job       1/1           5s         4m10s
example-collectinfo-metrics-job    1/1           4s         4m10s
```

#### Collecting information failed

The following shows a failure in information collection, with four types of information failing to be collected.

```console
kubectl get collectinfo
NAME                  COMPLETED   FAILED   STATE    AGE
example-collectinfo   1/5         4/5      Failed   52s
```

Check the job information of the `collectinfo`. You can see that only the hardware job of `collectinfo` succeeded, while the other jobs failed.

```console
kubectl get job -n alluxio-operator
NAME                               COMPLETIONS   DURATION   AGE
example-collectinfo-config-job     0/1           4m18s      4m18s
example-collectinfo-hardware-job   1/1           5s         4m18s
example-collectinfo-license-job    0/1           4m18s      4m18s
example-collectinfo-logs-job       0/1           4m18s      4m18s
example-collectinfo-metrics-job    0/1           4m18s      4m18s
```

You can always download the collection results regardless of the success or failure of the `collectinfo` operation.

The results will contain an `error.log` if there are any failures for debugging.

### Downloading Results

There are two ways to download the results of information: `kubectl cp` and `kubectl port-forward`.

Results contain the following types of information:

* config: The configuration files in Alluxio's conf/ directory, such as `alluxio-site.properties` and `alluxio-env.sh`.
* hardware: CPU and memory details for each Kubernetes node. Hardware specifications for coordinator, worker, fuse and operator components.
* license: The license information of the Alluxio cluster, including the type, productionId and licenseVersion. And vCPU, memory and storage are being used.
* logs: Logs from coordinator, worker, fuse, etcd and operator components. Supports tailing logs to show a specified number of lines from the end.
* metrics: Allows setting duration and step to define the time range and sampling interval for metrics (collects all metrics).

#### kubectl cp

Use `kubectl cp` to copy the collected information to your local machine.

```shell
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
# The name of the collectinfo <COLLECTINFO_NAME> needs to be filled in
kubectl cp alluxio-operator/${COLLECTINFO_CONTROLLER_NAME}:/tmp/output/<COLLECTINFO_NAME> output -n alluxio-operator
```

#### kubectl port-forward

Use `port-forward` to map the port of the `collectinfo` controller to your local machine. Map the remote `collectinfo` controller's port 80 to your local port 28080.

```shell
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
kubectl port-forward -n alluxio-operator ${COLLECTINFO_CONTROLLER_NAME} 28080:80
```

Use `curl` to download the collected information.

```shell
curl -H "Collectinfo-Name: <COLLECTINFO_NAME>" http://127.0.0.1:28080/download -o output.tar
```

Extract the downloaded file.

```shell
tar -xvf output.tar
```

### Detailed Configuration

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
spec:
  alluxio:
    # The namespace where the Alluxio cluster is located
    namespace: "default"
  # Information collection types, including config, hardware, license, logs, metrics
  # If not specified or set to "all", all information is collected
  # To specify multiple types:
  # type:
  #   - config
  #   - hardware
  type:
    - all
  # The number of retries. If a collection job fails, it will retry the specified number of times
  backoffLimit: 2
  logs:
    # The number of logs to collect, e.g., 100 means collecting the latest 100 logs
    tail: 100
  # Metrics information: "duration" indicates the collection duration, and "step" indicates the collection interval
  # The example below means collecting all metrics from now to the past two hours, with a one-minute interval between metrics
  metrics:
    # The duration of metrics collection, e.g., 2h means collecting metrics from now to the past two hours
    duration: 2h
    # The interval of metrics collection, e.g., 1m means collecting metrics every minute
    step: 1m
  # The image used for executing the collection task, defaulting to the Alluxio operator's image
  # Can be left unspecified to use the default value
  image: "<ALLUXIO_OPERATOR_IMAGE>"
  imagePullPolicy: "Always"
  # Resource limits for collecting information
  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"
    limits:
      memory: "1Gi"
      cpu: "500m"
```
