# Collecting Cluster Information

## Collecting cluster information

First, ensure that the operator has started successfully and that the `collectinfo` controller is running. Below is the information of the operator, showing that the `collectinfo` controller is running. If the `collectinfo` controller does not exist, it means the current version of the operator does not support the `collectinfo` feature. Please upgrade the operator version.

```console
kubectl get pod -n alluxio-operator
NAME                                             READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-8656d54bc-x6ms6       1/1     Running   0          19s
alluxio-collectinfo-controller-cc49c56b6-wlw8k   1/1     Running   0          19s
alluxio-csi-controller-84df9646fd-4d5b8          2/2     Running   0          19s
alluxio-csi-nodeplugin-fcp7b                     2/2     Running   0          19s
alluxio-csi-nodeplugin-t59ch                     2/2     Running   0          19s
alluxio-csi-nodeplugin-vbq2q                     2/2     Running   0          19s
alluxio-ufs-controller-57fbdf8d5c-2f79l          1/1     Running   0          19s
```

Ensure that the Alluxio cluster has started successfully. Assume the Alluxio cluster is in the `default` namespace. Below is the information of the Alluxio cluster, showing that all components of the Alluxio cluster are running.

```console
kubectl get pod 
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-coordinator-0                         1/1     Running   0          2m17s
alluxio-etcd-0                                1/1     Running   0          2m17s
alluxio-etcd-1                                1/1     Running   0          2m17s
alluxio-etcd-2                                1/1     Running   0          2m17s
alluxio-monitor-grafana-9fd587b4f-mnczs       1/1     Running   0          2m17s
alluxio-monitor-prometheus-6b55c568b8-sfp96   1/1     Running   0          2m17s
alluxio-worker-779d87567f-95wls               1/1     Running   0          2m17s
alluxio-worker-779d87567f-sgh4b               1/1     Running   0          2m17s
```

### Collecting Information

Create a simple YAML file to collect information using default values (for a complete configuration, refer to [Detailed Configuration](#detailed-configuration)).

Assuming the Alluxio cluster is in the `default` namespace, create `collectinfo.yaml` with the following contents.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
spec:
  alluxio:
    namespace: "default"
```

Create the `collectinfo` to start collecting information.

```console
kubectl apply -f collectinfo.yaml
```

You can check the progress of information collection by viewing the status of `collectinfo`. The following shows the collection of five types of information, all completed successfully.

```console
kubectl get collectinfo
NAME                  COMPLETED   FAILED   STATE       AGE
example-collectinfo   5/5         0/5      Completed   6m16s
```

The `collectinfo` will create multiple jobs in the `alluxio-operator` namespace to collect information about the Alluxio cluster. By default, all information is collected, and you can see that there are five jobs running: config, hardware, license, logs, and metrics.

```console
kubectl get job -n alluxio-operator
NAME                               COMPLETIONS   DURATION   AGE
example-collectinfo-config-job     1/1           4s         4m10s
example-collectinfo-hardware-job   1/1           5s         4m10s
example-collectinfo-license-job    1/1           10s        4m10s
example-collectinfo-logs-job       1/1           5s         4m10s
example-collectinfo-metrics-job    1/1           4s         4m10s
```

#### Collecting information failed

The following shows a failure in information collection, with four types of information failing to be collected.

```console
kubectl get collectinfo
NAME                  COMPLETED   FAILED   STATE    AGE
example-collectinfo   1/5         4/5      Failed   52s
```

Check the job information of the `collectinfo`. You can see that only the hardware job of `collectinfo` succeeded, while the other jobs failed.

```console
kubectl get job -n alluxio-operator
NAME                               COMPLETIONS   DURATION   AGE
example-collectinfo-config-job     0/1           4m18s      4m18s
example-collectinfo-hardware-job   1/1           5s         4m18s
example-collectinfo-license-job    0/1           4m18s      4m18s
example-collectinfo-logs-job       0/1           4m18s      4m18s
example-collectinfo-metrics-job    0/1           4m18s      4m18s
```

You can always download the collection results regardless of the success or failure of the `collectinfo` operation.

The results will contain an `error.log` if there are any failures for debugging.

### Downloading Results

There are two ways to download the results of information: `kubectl cp` and `kubectl port-forward`.

Results contain the following types of information:

* config: The configuration files in Alluxio's conf/ directory, such as `alluxio-site.properties` and `alluxio-env.sh`.
* hardware: CPU and memory details for each Kubernetes node. Hardware specifications for coordinator, worker, fuse and operator components.
* license: The license information of the Alluxio cluster, including the type, productionId and licenseVersion. And vCPU, memory and storage are being used.
* logs: Logs from coordinator, worker, fuse, etcd and operator components. Supports tailing logs to show a specified number of lines from the end.
* metrics: Allows setting duration and step to define the time range and sampling interval for metrics (collects all metrics).

#### kubectl cp

Use `kubectl cp` to copy the collected information to your local machine.

```shell
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
# The name of the collectinfo <COLLECTINFO_NAME> needs to be filled in
kubectl cp alluxio-operator/${COLLECTINFO_CONTROLLER_NAME}:/tmp/output/<COLLECTINFO_NAME> output -n alluxio-operator
```

#### kubectl port-forward

Use `port-forward` to map the port of the `collectinfo` controller to your local machine. Map the remote `collectinfo` controller's port 80 to your local port 28080.

```shell
# Set an environment variable to save the name of the collectinfo controller
COLLECTINFO_CONTROLLER_NAME=$(kubectl get pod -n alluxio-operator -l app.kubernetes.io/component=collectinfo-controller -o jsonpath="{.items[0].metadata.name}")
kubectl port-forward -n alluxio-operator ${COLLECTINFO_CONTROLLER_NAME} 28080:80
```

Use `curl` to download the collected information.

```shell
curl -H "Collectinfo-Name: <COLLECTINFO_NAME>" http://127.0.0.1:28080/download -o output.tar
```

Extract the downloaded file.

```shell
tar -xvf output.tar
```

### Detailed Configuration

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: CollectInfo
metadata:
  name: example-collectinfo
spec:
  alluxio:
    # The namespace where the Alluxio cluster is located
    namespace: "default"
  # Information collection types, including config, hardware, license, logs, metrics
  # If not specified or set to "all", all information is collected
  # To specify multiple types:
  # type:
  #   - config
  #   - hardware
  type:
    - all
  # The number of retries. If a collection job fails, it will retry the specified number of times
  backoffLimit: 2
  logs:
    # The number of logs to collect, e.g., 100 means collecting the latest 100 logs
    tail: 100
  # Metrics information: "duration" indicates the collection duration, and "step" indicates the collection interval
  # The example below means collecting all metrics from now to the past two hours, with a one-minute interval between metrics
  metrics:
    # The duration of metrics collection, e.g., 2h means collecting metrics from now to the past two hours
    duration: 2h
    # The interval of metrics collection, e.g., 1m means collecting metrics every minute
    step: 1m
  # The image used for executing the collection task, defaulting to the Alluxio operator's image
  # Can be left unspecified to use the default value
  image: "<ALLUXIO_OPERATOR_IMAGE>"
  imagePullPolicy: "Always"
  # Resource limits for collecting information
  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"
    limits:
      memory: "1Gi"
      cpu: "500m"
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.3/start/collectinfo.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
