> For the complete documentation index, see [llms.txt](https://documentation.alluxio.io/ee-ai-en/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.alluxio.io/ee-ai-en/ai-3.6/start/install/install-alluxio-on-kubernetes.md).

# Install on Kubernetes

This documentation shows how to install Alluxio on Kubernetes via [Operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), a Kubernetes extension for managing applications.

## Preparation

Please see [Resource Prerequisites and Compatibility](/ee-ai-en/ai-3.6/start/prerequisites.md) for resource planning recommendations.

It is assumed the required container images, both for Alluxio and third party components, are accessible to the Kubernetes cluster. See [handling images](/ee-ai-en/ai-3.6/start/install/handling-images.md) for instructions on how to extract and upload the Alluxio images from a provided tarball package, as well as which third party images to upload in the case the Kubernetes cluster does not have access to public image repositories.

### Extract the helm chart for operator

Download the operator helm chart tarball into a location with access to deploy on a running Kubernetes cluster.

```console
# the command will extract the files to the directory alluxio-operator/
$ tar zxf alluxio-operator-3.2.1-helmchart.tgz
```

The extracted `alluxio-operator` directory contains the Helm chart files responsible for deploying the operator.

## Deployment

### Deploy Alluxio operator

Create the `alluxio-operator/alluxio-operator.yaml` file to specify the image and version used for deploying the operator. The following example shows how to specify the `operator` image and version:

```yaml
global:
  image: <PRIVATE_REGISTRY>/alluxio-operator
  imageTag: 3.2.1
```

Move to the `alluxio-operator` directory and execute the following command to deploy the operator:

```console
$ cd alluxio-operator
# the last parameter is the directory to the helm chart, "." means the current directory
$ helm install operator -f alluxio-operator.yaml .
NAME: operator
LAST DEPLOYED: Wed May 15 17:32:34 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

# verify if the operator is running as expected
$ kubectl -n alluxio-operator get pod
NAME                                              READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-5647cc664d-lrx84       1/1     Running   0          14s
alluxio-collectinfo-controller-667b746fd6-hfzqk   1/1     Running   0          14s
alluxio-csi-controller-7bd66df6cf-7kh6k           2/2     Running   0          14s
alluxio-csi-nodeplugin-9cc9v                      2/2     Running   0          14s
alluxio-csi-nodeplugin-fgs5z                      2/2     Running   0          14s
alluxio-csi-nodeplugin-v22q6                      2/2     Running   0          14s
alluxio-ufs-controller-5f6d7c4d66-drjgm           1/1     Running   0          14s
```

> Deploying alluxio operator requires pulling dependent images from the public image repository. If you fail to deploy `alluxio-operator` because the network environment cannot access the public image repository, please refer to [Configuring alluxio-operator image](/ee-ai-en/ai-3.6/start/install/handling-images.md#unable-to-access-public-image-registry).

### Deploy Alluxio

Create the `alluxio-operator/alluxio-cluster.yaml` file to deploy the Alluxio cluster. The file below shows the minimal configuration, which is recommended for testing scenarios.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.6-12.0.2
  properties:

  worker:
    count: 2
    pagestore:
      size: 100Gi
      reservedSize: 10Gi
```

The minimal configuration provided above can help you quickly deploy the Alluxio cluster for testing and validation. In a production environment where restarts are anticipated, we recommend deploying the Alluxio cluster using labels and selectors, as well as persisting information on PVCs.

Select a group of Kubernetes nodes to run the Alluxio cluster, and label the nodes accordingly:

```console
kubectl label nodes <node-name> alluxio-role=coordinator
kubectl label nodes <node-name> alluxio-role=worker
```

The following configuration is a starting template for production scenarios, where `nodeSelector` and `metastore` fields are added.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.6-12.0.2
  properties:

  coordinator:
    nodeSelector:
      alluxio-role: coordinator
    metastore:
      type: persistentVolumeClaim
      storageClass: "gp2"
      size: 4Gi
    
  worker:
    nodeSelector:
      alluxio-role: worker
    count: 2
    pagestore:
      size: 100Gi
      reservedSize: 10Gi
```

Move to the `alluxio-operator` directory and execute the following commands to deploy the Alluxio cluster:

```console
$ cd alluxio-operator
$ kubectl create namespace alx-ns
$ kubectl create -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster created

# the cluster will be starting
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS              RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:0/1            0          7s
alluxio-cluster-etcd-0                        0/1     ContainerCreating   0          7s
alluxio-cluster-etcd-1                        0/1     ContainerCreating   0          7s
alluxio-cluster-etcd-2                        0/1     ContainerCreating   0          7s
alluxio-cluster-grafana-847fd46f4b-84wgg      0/1     Running             0          7s
alluxio-cluster-prometheus-778547fd75-rh6r6   1/1     Running             0          7s
alluxio-cluster-worker-76c846bfb6-2jkmr       0/1     Init:0/2            0          7s
alluxio-cluster-worker-76c846bfb6-nqldm       0/1     Init:0/2            0          7s

# check the status of the cluster
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          2m18s

# and check the running pods after the cluster is ready
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          2m3s
alluxio-cluster-etcd-0                        1/1     Running   0          2m3s
alluxio-cluster-etcd-1                        1/1     Running   0          2m3s
alluxio-cluster-etcd-2                        1/1     Running   0          2m3s
alluxio-cluster-grafana-7b9477d66-mmcc5       1/1     Running   0          2m3s
alluxio-cluster-prometheus-78dbb89994-xxr4c   1/1     Running   0          2m3s
alluxio-cluster-worker-85fd45db46-c7n9p       1/1     Running   0          2m3s
alluxio-cluster-worker-85fd45db46-sqv2c       1/1     Running   0          2m3s
```

In Alluxio 3.x, the coordinator is a stateless control component that serves as an interface to the whole cluster, such as serving jobs like distributed load.

> If some components in the cluster do not reach the `Running` state, you can use `kubectl describe pod` to view detailed information and identify the issue. For specific issues encountered during deployment, refer to the [FAQ](#FAQ) section.

> Alluxio cluster also includes etcd and monitoring components. If the image cannot be pulled from the public image registry, causing etcd and monitoring to fail to start, please refer to [Configuring Alluxio Cluster Image](/ee-ai-en/ai-3.6/start/install/handling-images.md#unable-to-access-public-image-registry).

### Mount storage to Alluxio

Alluxio supports integration with various underlying storage systems, including [S3](/ee-ai-en/ai-3.6/ufs/s3.md), [HDFS](/ee-ai-en/ai-3.6/ufs/hdfs.md), [OSS](/ee-ai-en/ai-3.6/ufs/aliyun-oss.md), [COS](/ee-ai-en/ai-3.6/ufs/cos.md), and [TOS](/ee-ai-en/ai-3.6/ufs/tos.md).

With the operator, you can mount underlying storage by creating `UnderFileSystem` resources. An `UnderFileSystem` corresponds to a mount point for Alluxio. Regarding the Alluxio and the underlying storage namespace, please refer to [Alluxio Namespace and Under File System Namespaces](/ee-ai-en/ai-3.6/overview/namespace.md).

Create the `alluxio-operator/ufs.yaml` file to specify the UFS configuration. The following example shows how to mount an S3 bucket to Alluxio.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: UnderFileSystem
metadata:
  name: alluxio-s3
  namespace: alx-ns
spec:
  alluxioCluster: alluxio-cluster
  path: s3://<S3_BUCKET>/<S3_DIRECTORY>
  mountPath: /s3
  mountOptions:
    s3a.accessKeyId: <S3_ACCESS_KEY_ID>
    s3a.secretKey: <S3_SECRET_KEY>
    alluxio.underfs.s3.region: <S3_REGION>
```

Find more details about mounting S3 to Alluxio in [Amazon AWS S3](/ee-ai-en/ai-3.6/ufs/s3.md).

#### Executing the mount

First, ensure that the Alluxio cluster is up and running with a `Ready` or `WaitingForReady` status.

```console
# check the status of the cluster
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          2m18s
```

Execute the following command to create the `UnderFileSystem` resource and mount that to Alluxio namespace:

```console
$ cd alluxio-operator
$ kubectl create -f ufs.yaml
underfilesystem.k8s-operator.alluxio.com/alluxio-s3 created

# verify the status of the storage
$ kubectl -n alx-ns get ufs
NAME         PHASE   AGE
alluxio-s3   Ready   46s

# also check the mount table via Alluxio command line
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio mount list 2>/dev/null
Listing all mount points
s3://my-bucket/path/to/mount  on  /s3/ properties={s3a.secretKey=xxx, alluxio.underfs.s3.region=us-east-1, s3a.accessKeyId=xxx}
```

## Monitoring

The Alluxio cluster enables monitoring by default. You can view various Alluxio metrics visually through Grafana. Please refer to the [Monitoring and Metrics](/ee-ai-en/ai-3.6/start/monitoring-and-metrics.md#kubernetes-operator) section on Kubernetes Operator.

## Data Access Acceleration

In the steps above, you deployed the Alluxio cluster and mounted the under file system to Alluxio. Training tasks that read data through Alluxio can improve training speed and GPU utilization. Majorly, Alluxio provides three ways for applications to access data:

* **FUSE based POSIX API:** Please refer to [FUSE based POSIX API](/ee-ai-en/ai-3.6/data-access/fuse-based-posix-api.md).
* **S3 API:** Please refer to [S3 API](/ee-ai-en/ai-3.6/data-access/s3-api.md).
* **FSSpec Python API:** Please refer to [FSSpec Python API](/ee-ai-en/ai-3.6/data-access/fsspec.md).

## FAQ

### etcd pod stuck in pending status

For example, if three etcd pods remain in the `Pending` state, you can use `kubectl describe pod` to view detailed information:

```console
# Check the status of the pods
kubectl -n alx-ns get pod

NAME                                          READY   STATUS     RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:1/2   0          73s
alluxio-cluster-etcd-0                        0/1     Pending    0          73s
alluxio-cluster-etcd-1                        0/1     Pending    0          73s
alluxio-cluster-etcd-2                        0/1     Pending    0          73s
alluxio-cluster-grafana-79db8c7dd9-lsq2l      1/1     Running    0          73s
alluxio-cluster-prometheus-7c6cbc4b4c-9nk25   1/1     Running    0          73s
alluxio-cluster-worker-8c79d5fd4-2c994        0/1     Init:1/2   0          73s
alluxio-cluster-worker-8c79d5fd4-jrchj        0/1     Init:1/2   0          73s

# Check detailed information about the etcd pod
kubectl -n alx-ns describe pod alluxio-cluster-etcd-0

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m57s  default-scheduler  0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., .

# Check the PVC Status in the Cluster
# If you find that the etcd PVCs are stuck in the Pending state (note that the alluxio-fuse being in Pending state is normal), you can investigate further.
kubectl -n alx-ns get pvc

NAME                          STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
alluxio-cluster-fuse          Pending                                      alx-ns-alluxio-cluster-fuse   5m31s
data-alluxio-cluster-etcd-0   Pending                                                                    3h41m
data-alluxio-cluster-etcd-1   Pending                                                                    3h41m
data-alluxio-cluster-etcd-2   Pending                                                                    3h41m

# Check the PVC description
kubectl -n alx-ns describe pvc data-alluxio-cluster-etcd-0

Events:
  Type    Reason         Age                      From                         Message
  ----    ------         ----                     ----                         -------
  Normal  FailedBinding  4m16s (x889 over 3h44m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set
```

Based on the error message, the etcd pods are stuck in the Pending state because no storage class is set. You can resolve this issue by specifying the storage class for etcd in the `alluxio-operator/alluxio-cluster.yaml` file:

```yaml
  etcd:
    persistence:
      storageClass: <STORAGE_CLASS>
      size: 
```

First, delete the Alluxio cluster and the etcd PVC, then recreate the Alluxio cluster:

```console
# Delete the Alluxio cluster
$ kubectl delete -f alluxio-operator/alluxio-cluster.yaml

# Delete the etcd PVC
$ kubectl -n alx-ns delete pvc data-alluxio-cluster-etcd-0
$ kubectl -n alx-ns delete pvc data-alluxio-cluster-etcd-1
$ kubectl -n alx-ns delete pvc data-alluxio-cluster-etcd-2

# Recreate the Alluxio cluster
$ kubectl create -f alluxio-operator/alluxio-cluster.yaml
```

Another issue is the etcd PVC specifies a storage class, but both the etcd pod and PVC remain in a pending state. For example, as shown in the detailed information of the PVC below, the storage class specified for the etcd PVC does not support dynamic provisioning, and the storage volume needs to be manually created by the cluster administrator.

```console
# Check the PVC description
kubectl -n alx-ns describe pvc data-alluxio-cluster-etcd-0

Events:
  Type    Reason                Age               From                         Message
  ----    ------                ----              ----                         -------
  Normal  WaitForFirstConsumer  25s               persistentvolume-controller  waiting for first consumer to be created before binding
  Normal  ExternalProvisioning  8s (x3 over 25s)  persistentvolume-controller  Waiting for a volume to be created either by the external provisioner 'none' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
```

For similar issues where etcd pods remain in the Pending state, you can use the above method for troubleshooting.

### alluxio-cluster-fuse PVC in pending status

After creating the cluster, you might notice that `alluxio-cluster-fuse` is in the `Pending` status. This is normal. The PVC will automatically bind to a PV and its status will change to `Bound` when it is used by a client pod.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.6/start/install/install-alluxio-on-kubernetes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.