# Kubernetes Installation

This documentation shows how to install Alluxio on Kubernetes via [Operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/).

## Overview

### Artifacts

Your will receive download links for three artifacts:

| Artifact       | Filename                                                  | Purpose                                                 |
| -------------- | --------------------------------------------------------- | ------------------------------------------------------- |
| Helm chart     | `alluxio-operator-3.5.2-helmchart.tgz`                    | Deploys the Operator onto Kubernetes                    |
| Operator image | `alluxio-operator-3.5.2-linux-amd64-docker.tar`           | Container image for the Operator pod                    |
| Alluxio image  | `alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar` | Container image for Alluxio worker and coordinator pods |
| License        |                                                           | Required to activate the cluster                        |

> **Version**: The version strings in these filenames are examples. The download link you receive will contain the exact version your sales representative provisioned. **Platform**: Use `-linux-amd64-docker.tar` for x86 nodes or `-linux-arm64-docker.tar` for ARM nodes.

The Docker images are **never pulled from a public registry** — they are loaded from the `.tar` files and pushed to your private registry before deployment.

```
Helm chart (.tgz)
  └─► deploys ─► Operator pod  (alluxio-operator image)
                    └─► watches AlluxioCluster CRD
                           └─► creates ─► Alluxio pods  (alluxio-enterprise image)
```

### Kubernetes Components

A deployed Alluxio cluster consists of:

* **Operator** — Manages the lifecycle of Alluxio clusters. Installed once per Kubernetes cluster.
* **Coordinator** — Handles background operations (data loading, freeing). 1 replica.
* **Workers** — Cache data and serve reads via S3 API or FUSE. Scale horizontally for more cache capacity.
* **ETCD** — Service discovery and mount table storage. 3 replicas recommended for quorum.
* **Monitoring** (optional) — Prometheus and Grafana. Enabled by default.

## Before You Start

Run these checks before starting (\~2 minutes). Skipping this step is the most common cause of deployment failures.

* [ ] **Docker** is installed on the machine running these steps — required to load and push images in Step 0
* [ ] **Kubernetes 1.24 or higher** (the default StatefulSet worker type requires `rollingUpdate.maxUnavailable`, which was added in 1.24):

  ```shell
  kubectl version
  ```
* [ ] **Helm 3.10.0 or higher:**

  ```shell
  helm version
  ```
* [ ] **RBAC permissions** — your kubeconfig must be able to create CRDs, ServiceAccounts, ClusterRoles, and ClusterRoleBindings
* [ ] **kubectl** is configured and pointing to the correct cluster
* [ ] **Nodes available** for each Alluxio component:
  * Coordinator: at least 1 node
  * Workers: at least 1 node
  * ETCD: at least 1 node for testing, 3 nodes for quorum in production
* [ ] **StorageClass** exists and supports dynamic provisioning (required for ETCD and coordinator metastore PVCs):

  ```shell
  kubectl get storageclass
  ```

  Platform differences may affect StorageClass and FUSE mount behavior. See [Appendix G: Platform-Specific Notes](#g-platform-specific-notes) for EKS, GKE, and kind requirements.
* [ ] **Worker cache storage planned** — the [page store](https://documentation.alluxio.io/ee-ai-en/start/pages/iRTxT4smG58AwFmCihOx#id-5.-worker-storage-the-page-store) defaults to `hostPath: /mnt/alluxio/pagestore` on the node's filesystem. For multi-disk nodes or persistent cache across pod restarts, review [Worker Page Store](#worker-page-store) in Production Recommendations before deployment.
* [ ] **Network policies** allow the required ports between Alluxio pods and between application pods and workers. See [Prerequisites → Networking](/ee-ai-en/start/prerequisites.md#networking) for the full port list.
* [ ] **All three artifacts downloaded** from the email link: Helm chart `.tgz`, operator image `.tar`, Alluxio image `.tar`
* [ ] **Private registry** is available and your local Docker is authenticated to push to it
* [ ] **Alluxio license string** is available
* [ ] **S3 bucket + credentials (if using S3)** are available

For resource sizing (CPU, RAM, cache disk per component), see [Prerequisites → Resource Sizing](/ee-ai-en/start/prerequisites.md#resource-sizing).

{% hint style="warning" %}
**EKS 1.23+**: The in-tree EBS driver was removed. Install the [AWS EBS CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) add-on **before** deploying Alluxio. Without it, etcd PVCs remain `Pending` with no clear error message.
{% endhint %}

## Installation Steps

### 0. Push Alluxio Images to Your Private Registry

> **Skip this step** if the Alluxio images are already present in your private registry.

Alluxio images are delivered as `.tar` files and must be loaded and pushed to your private registry before the Helm chart can deploy them.

**Load the images into your local Docker:**

```shell
docker load -i alluxio-operator-3.5.2-linux-amd64-docker.tar
docker load -i alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar
```

**✅ Success:** `docker images` shows both images:

```console
REPOSITORY                    TAG
alluxio/operator              3.5.2
alluxio/alluxio-enterprise    AI-3.8-15.1.2
```

**Retag and push to your private registry:**

```shell
docker tag alluxio/operator:3.5.2 <PRIVATE_REGISTRY>/alluxio-operator:3.5.2
docker tag alluxio/alluxio-enterprise:AI-3.8-15.1.2 <PRIVATE_REGISTRY>/alluxio-enterprise:AI-3.8-15.1.2

docker push <PRIVATE_REGISTRY>/alluxio-operator:3.5.2
docker push <PRIVATE_REGISTRY>/alluxio-enterprise:AI-3.8-15.1.2
```

**✅ Success:** Both images are now in your private registry. Note down the full image paths — you will use them in Steps 1 and 4.

> If your Kubernetes cluster also cannot reach public registries (air-gapped), third-party images (etcd, CSI) also need to be relocated. See [Appendix A: Air-Gapped Deployment](#a.-air-gapped-deployment).

### 1. Prepare Helm Chart

Extract the Helm chart:

```shell
tar zxf alluxio-operator-3.5.2-helmchart.tgz
# creates ./alluxio-operator/
```

Create `alluxio-operator.yaml` (outside the chart directory) to specify the operator image from your private registry:

```yaml
global:
  image: <PRIVATE_REGISTRY>/alluxio-operator
  imageTag: 3.5.2
```

### 2. Create Namespace

```shell
kubectl create namespace alx-ns
```

### 3. Deploy Operator

```shell
cd alluxio-operator
helm -n alluxio-operator install operator -f alluxio-operator.yaml --create-namespace .
```

**✅ Success:** Helm prints `STATUS: deployed` immediately after the command completes:

```console
NAME: operator
STATUS: deployed
REVISION: 1
```

Then verify all pods are running:

```shell
kubectl -n alluxio-operator get pod
```

**✅ Success:** All operator pods show `READY 1/1` or `2/2`, `STATUS = Running`, and `RESTARTS = 0`.

An example output is like:

```console
NAME                                              READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-5db4f967f5-hg4v4       1/1     Running   0          30s
alluxio-console-5c7ff88b88-s4x4c                  1/1     Running   0          30s
alluxio-csi-controller-77d8d7bd56-27tqh           2/2     Running   0          30s
alluxio-csi-nodeplugin-2pzrq                      2/2     Running   0          31s
alluxio-csi-nodeplugin-94mjw                      2/2     Running   0          31s
alluxio-csi-nodeplugin-tgs86                      2/2     Running   0          31s
alluxio-doctor-controller-64847d459c-pczft        1/1     Running   0          30s
alluxio-license-controller-6c76b59f9b-lfkmz       1/1     Running   0          30s
alluxio-ufs-controller-5c8bdf48c9-tnw48           1/1     Running   0          30s
```

> If pods fail with image pull errors on etcd or CSI images, see [Appendix A: Air-Gapped Deployment](#a.-air-gapped-deployment).

### 4. Deploy Cluster

{% hint style="warning" %}
**Configure the hash ring before deploying.** The hash ring determines how data is distributed across workers. Changing it after deployment is a destructive operation — all cached data will be lost. If you have heterogeneous workers or specific capacity requirements, review [Hash Ring — Pre-Deployment Configuration](https://documentation.alluxio.io/ee-ai-en/start/pages/paqXm2SRU7Uo5W1x23MT#id-1.-pre-deployment-configuration) before proceeding.
{% endhint %}

Create a minimal `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.8-15.1.2
  properties:
    alluxio.license: <YOUR_LICENSE>   # cluster license — if you have a deployment license, omit this and see Appendix C
  worker:
    count: 1
    pagestore:
      # Defaults to hostPath: /mnt/alluxio/pagestore on the node's filesystem.
      # For multi-disk nodes or PVC, see Production Recommendations → Worker Page Store.
      size: 90Gi       # Cache capacity per worker. Must not exceed available disk on the worker node.
      reservedSize: 9Gi
  etcd:
    replicaCount: 3
```

Deploy:

```shell
kubectl apply -f alluxio-cluster.yaml
```

**✅ Success:** Startup typically takes **2–3 minutes**. The first deployment may take longer if the Alluxio image (\~1.8 GB) needs to be pulled from the registry. To watch progress in real time:

```shell
kubectl -n alx-ns get pod --watch
```

Once all pods are running, `kubectl -n alx-ns get alluxiocluster` shows `CLUSTERPHASE` = `Ready`

```console
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          2m18s
```

and `kubectl -n alx-ns get pod` shows all pods to become `Ready` and `STATUS = Running`.

```console
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          2m3s
alluxio-cluster-etcd-0                        1/1     Running   0          2m3s
alluxio-cluster-etcd-1                        1/1     Running   0          2m3s
alluxio-cluster-etcd-2                        1/1     Running   0          2m3s
alluxio-cluster-worker-0                      1/1     Running   0          2m3s
alluxio-cluster-grafana-585d767c84-p8wcp      1/1     Running   0          2m3s
alluxio-cluster-prometheus-6f697b6db8-sbvvg   1/1     Running   0          2m3s
```

> If any component fails to start, see [Appendix D: Troubleshooting](#d.-troubleshooting).

To access the Grafana dashboard and import the Alluxio dashboard template, see [Monitoring](/ee-ai-en/administration/monitoring-alluxio.md).

### 5. Mount Storage

Create `ufs.yaml` (S3 example; for other storage systems, see [Underlying Storage](/ee-ai-en/ufs.md)):

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: UnderFileSystem
metadata:
  name: alluxio-s3
  namespace: alx-ns
spec:
  alluxioCluster: alluxio-cluster
  path: s3://<S3_BUCKET>/<S3_DIRECTORY>
  mountPath: /s3
  mountOptions:
    s3a.accessKeyId: <S3_ACCESS_KEY_ID>
    s3a.secretKey: <S3_SECRET_KEY>
    alluxio.underfs.s3.region: <S3_REGION>
```

Apply:

```shell
kubectl apply -f ufs.yaml
```

**✅ Success:** `kubectl -n alx-ns get ufs` shows `PHASE = Ready`.

Example out:

```console
NAME         PHASE   AGE
alluxio-s3   Ready   32s
```

### 6. Verify Cluster

```shell
kubectl -n alx-ns exec -i alluxio-cluster-coordinator-0 -- alluxio mount list
```

**✅ Success:** Output displays your mount point (e.g., `s3://my-bucket/... on /s3/`).

Example out:

```console
Defaulted container "alluxio-coordinator" out of: alluxio-coordinator, path-permission (init), wait-etcd (init)
Listing all mount points
s3://<S3_BUCKET>/  on  /s3/ properties={alluxio.underfs.s3.region=<S3_REGION>}
```

### 7. Verify Data Access

```shell
kubectl -n alx-ns exec -i alluxio-cluster-coordinator-0 -- alluxio fs ls /
```

**✅ Success:** Returns a directory listing without errors.

**Next: connect your application** to the cluster by picking an access method:

* [**POSIX API (FUSE)**](/ee-ai-en/data-access/fuse-based-posix-api.md#quick-start) — mount Alluxio as a standard filesystem.
* [**S3 API**](/ee-ai-en/data-access/s3-api.md#quick-start) — HTTP endpoint compatible with S3 clients.
* [**Python FSSpec API**](/ee-ai-en/data-access/fsspec.md) — for Python-based ML frameworks.

## Uninstall

To remove the Alluxio deployment from your cluster, run the following commands in order:

**1. Delete the UFS mount and cluster:**

```shell
kubectl delete -f ufs.yaml
kubectl delete -f alluxio-cluster.yaml
```

**2. Uninstall the operator:**

```shell
cd alluxio-operator
helm -n alluxio-operator uninstall operator
```

**3. Delete the namespaces:**

```shell
kubectl delete namespace alx-ns
kubectl delete namespace alluxio-operator
```

**✅ Success:** `kubectl get namespace` no longer shows `alx-ns` or `alluxio-operator`, and `kubectl get alluxiocluster -A` returns `No resources found`.

## Next Steps: Production Setup

The configuration above is suitable for evaluation. For production deployments (node pinning, resource tuning, persistent metastore, license, hash-ring pre-configuration, heterogeneous workers, etc.), see [Production Setup](https://documentation.alluxio.io/ee-ai-en/start/pages/Zi86aQPzWMrc2orY9tBY#id-1.-production-setup).

***

## Appendix

Use the table below to find the relevant appendix section for your scenario:

| Scenario                                                | Sections                                                                                                             |
| ------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Air-gapped** (cluster cannot reach public registries) | [A. Air-Gapped Deployment](#a.-air-gapped-deployment), [D.8. Image Pull Secrets](#b.5.-configure-image-pull-secrets) |
| **External or custom ETCD**                             | [B.3. External ETCD](#b.3.-use-external-etcd), [B.4. Customize ETCD](#b.4.-customize-etcd-configuration)             |
| **Production licensing**                                | [C. License Management](#c.-license-management)                                                                      |
| **Something went wrong**                                | [D. Troubleshooting](#d.-troubleshooting)                                                                            |

### A. Air-Gapped Deployment

#### Pre-flight: Verify Node-to-Registry Connectivity

Before deploying, verify that your worker nodes can pull images from your private registry. Catching this early avoids `ImagePullBackOff` failures mid-deployment.

SSH into each worker node and run:

```shell
crictl pull <PRIVATE_REGISTRY>/alluxio-enterprise:<TAG>
```

`crictl` is available on any node running containerd or CRI-O. If the pull fails, resolve registry connectivity or authentication on the node before proceeding.

#### Symptom: ImagePullBackOff After Deploy

**Symptom**: After deploying the operator or cluster, some pods are stuck in `ImagePullBackOff`. Your Kubernetes cluster cannot reach public registries to pull third-party component images (CSI, etcd, monitoring).

Alluxio images are already in your private registry from [Prepare: Push Alluxio Images](#prepare-push-alluxio-images-to-your-private-registry). The remaining images to relocate depend on your specific operator version. Identify them by inspecting the stuck pods:

```shell
kubectl get pods -A | grep -E "ImagePullBackOff|ErrImagePull"
kubectl describe pod <pod-name> -n <namespace> | grep "image:"
```

For each image that cannot be pulled: pull it from a machine with public internet access, retag it for your private registry, and push it. Then update `alluxio-operator.yaml` or `alluxio-cluster.yaml` to point to your private registry for that component.

The CSI images (part of the operator) can be overridden in `alluxio-operator.yaml`:

```yaml
alluxio-csi:
  controllerPlugin:
    provisioner:
      image: <PRIVATE_REGISTRY>/csi-provisioner:<TAG>
  nodePlugin:
    driverRegistrar:
      image: <PRIVATE_REGISTRY>/csi-node-driver-registrar:<TAG>
```

The etcd image (part of the cluster) can be overridden in `alluxio-cluster.yaml`:

```yaml
spec:
  etcd:
    image:
      registry: <PRIVATE_REGISTRY>
      repository: <PRIVATE_REPOSITORY>/etcd
      tag: <TAG>
```

### B. Advanced Configuration

This section describes common configurations to adapt to different scenarios.

#### B.1. Configuring Alluxio Properties

To modify Alluxio's configuration, edit the `.spec.properties` field in the `alluxio-cluster.yaml` file. These properties are appended to the `alluxio-site.properties` file inside the Alluxio pods.

#### B.2. Mount Custom ConfigMaps or Secrets

You can mount custom `ConfigMap` or `Secret` files into your Alluxio pods. This is useful for providing configuration files like `core-site.xml` or credentials.

**Example: Mount a Secret**

1. Create the secret from a local file:

   ```shell
   kubectl -n alx-ns create secret generic my-secret --from-file=/path/to/my-file
   ```
2. Specify the secret to load and the mount path in `alluxio-cluster.yaml`:

   ```yaml
   apiVersion: k8s-operator.alluxio.com/v1
   kind: AlluxioCluster
   spec:
     secrets:
       worker:
         my-secret: /opt/alluxio/secret
       coordinator:
         my-secret: /opt/alluxio/secret
   ```

   The file `my-file` will be available at `/opt/alluxio/secret/my-file` on the pods.

#### B.3. Use External ETCD

If you have an external ETCD cluster, you can configure Alluxio to use it instead of the one deployed by the operator.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  etcd:
    enabled: false
  properties:
    alluxio.etcd.endpoints: http://external-etcd:2379
    # If using TLS for ETCD, add the following:
    # alluxio.etcd.tls.enabled: "true"
```

#### B.4. Customize ETCD configuration

The fields under `spec.etcd` follow the [Bitnami ETCD helm chart](https://github.com/bitnami/charts/blob/main/bitnami/etcd/values.yaml). For example, to set node affinity for etcd pods, the `affinity` field can be used as described in the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity).

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  etcd:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
              - antarctica-east1
              - antarctica-west1
```

#### B.5. Configure Image Pull Secrets

If your container images are stored in a private registry that requires authentication, you need to create a Kubernetes Secret to store your registry credentials.

This secret must be created in the namespace where you plan to install Alluxio.

```shell
kubectl create secret docker-registry <SECRET_NAME> \
  --docker-server=<REGISTRY_SERVER> \
  --docker-username=<USERNAME> \
  --docker-password=<PASSWORD> \
  -n <NAMESPACE>
```

### C. License Management

Alluxio requires a license provided by your Alluxio sales representative. There are two license types:

* **Cluster license** — scoped to a single cluster. Set inline via the `alluxio.license` property in `alluxio-cluster.yaml`. Suitable for both evaluation and single-cluster production deployments.
* **Deployment license** — covers multiple clusters, where each cluster has its own independent capacity constraints within the license. Applied as a separate `License` Kubernetes resource. Use this when managing more than one Alluxio cluster.

#### C.1. Cluster License

Add the license string directly to `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.license: <YOUR_CLUSTER_LICENSE>
```

#### C.2. Deployment License

A deployment license is applied as a separate `License` Kubernetes resource and can cover multiple clusters. Each cluster's cache capacity is measured independently against its own constraints defined within the deployment license — it is not a shared pool.

**Step 1: Deploy the cluster without a license**

Follow Step 5 of the main guide, but omit the `alluxio.license` property from `alluxio-cluster.yaml`. The pods start but remain in `Init` state until the license is applied.

**Step 2: Apply the license**

Create `alluxio-license.yaml`. The `name` and `namespace` in the `clusters` list must match the `AlluxioCluster` metadata.

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: License
metadata:
  name: alluxio-license
  namespace: alx-ns
spec:
  clusters:
  - name: alluxio-cluster
    namespace: alx-ns
  licenseString: <YOUR_DEPLOYMENT_LICENSE>
```

```shell
kubectl create -f alluxio-license.yaml
```

The pods detect the license and transition to `Running`.

{% hint style="warning" %}
Only list clusters that are already running in the `clusters` field. If the operator cannot find a listed cluster, the license operation fails for all clusters in the list.
{% endhint %}

#### C.3. Updating a Deployment License

To update an existing deployment license, update the `licenseString` in your `alluxio-license.yaml` and re-apply it:

```shell
kubectl delete -f alluxio-license.yaml
kubectl create -f alluxio-license.yaml
```

#### C.4. Checking License Status

You can check the license details and utilization from within the Alluxio coordinator pod.

```shell
# Get a shell into the coordinator pod
kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- /bin/bash

# View license details (expiration, capacity)
alluxio license show

# View current utilization
alluxio license status
```

### D. Troubleshooting

#### D.1. etcd pod stuck in pending status

If `etcd` pods are `Pending`, it is often due to storage issues. Use `kubectl describe pod <etcd-pod-name>` to check events.

**Symptom**: Event message shows `pod has unbound immediate PersistentVolumeClaims`.

**Cause**: No `storageClass` is set for the PVC, or no PV is available.

**Solution**: Specify a `storageClass` in `alluxio-cluster.yaml`:

```yaml
spec:
  etcd:
    persistence:
      storageClass: <YOUR_STORAGE_CLASS>
      size: 10Gi # Example size
```

Then, delete the old cluster and PVCs before recreating the cluster.

**Symptom**: Event message shows `waiting for first consumer`.

**Cause**: The `storageClass` does not support dynamic provisioning, and a volume must be manually created by an administrator.

**Solution**: Either use a dynamic provisioner or manually create a PersistentVolume that satisfies the claim.

#### D.2. etcd pods stuck in Pending due to anti-affinity (fewer than 3 nodes)

**Symptom**: etcd pods are `Pending` with event message `0/N nodes are available: N node(s) didn't match pod anti-affinity rules`.

**Cause**: The operator deploys etcd with `requiredDuringSchedulingIgnoredDuringExecution` anti-affinity by hostname. With `etcd.replicaCount: 3` (the default), Kubernetes requires 3 distinct nodes. If your cluster has fewer than 3 nodes, etcd pods cannot be scheduled.

**Solution**: For dev/test clusters with fewer than 3 nodes, reduce the replica count:

```yaml
spec:
  etcd:
    replicaCount: 1   # For single-node dev/test only; use 3 for production
```

> Do not use `replicaCount: 1` in production — a single etcd instance has no quorum and is not fault-tolerant.

#### D.3. alluxio-cluster-fuse PVC in pending status

The `alluxio-cluster-fuse` PVC remaining in a `Pending` state is normal. It will automatically bind to a volume and become `Bound` once a client application pod starts using it.

#### D.4. Worker pod stuck in CrashLoopBackOff

**Symptom**: Worker pod repeatedly crashes and restarts.

Start by checking the worker logs:

```shell
kubectl -n alx-ns logs <worker-pod-name> --previous
```

Common causes include:

* **Pagestore quota exceeds disk space** — log shows `quota (NNN) exceeds the total disk space`. This commonly occurs because cloud providers advertise disk size in GB (base-10), while Kubernetes interprets `Gi` as GiB (base-2). Fix: reduce `pagestore.size` to \~90% of actual available space (`df -h /mnt/alluxio`) and `reservedSize` to \~10% of `size`.
* **License expired or invalid** — log shows a license error. Fix: apply a new license. See [Appendix C: License Management](#c.-license-management).
* **OOM killed** — log shows `Exit Code 137` or `OutOfMemoryError`. Fix: increase container memory limits and adjust `-Xmx` / `-XX:MaxDirectMemorySize`. See [Worker Configuration — Resource and JVM Tuning](https://documentation.alluxio.io/ee-ai-en/start/pages/MF8PfNU0P7q8PD9eh6xW#id-2.-resource-and-jvm-tuning).

### E. Platform-Specific Notes

The main installation steps are platform-agnostic. This section documents known differences for specific Kubernetes environments.

#### Amazon EKS

**EBS CSI driver required for EKS 1.23+**: The in-tree EBS volume driver was removed in Kubernetes 1.23. On EKS 1.23+, install the [AWS EBS CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) add-on, or PVC provisioning will silently fail even if a StorageClass is listed.

Verify:

```shell
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver --no-headers | head -1
```

If no output, the driver is not installed.

#### Google GKE

**Read-only `/mnt` path**: GKE nodes have a read-only root filesystem. Multiple Alluxio components default to hostPaths under `/mnt/alluxio/`, causing worker pods to fail with:

```console
MountVolume.SetUp failed for volume "pagestore-0" : mkdir /mnt/alluxio: read-only file system
MountVolume.SetUp failed for volume "logs" : mkdir /mnt/alluxio: read-only file system
MountVolume.SetUp failed for volume "metastore" : mkdir /mnt/alluxio: read-only file system
```

**Workaround**: Redirect all hostPaths to a writable base directory (e.g., `/home/alluxio/`) in `alluxio-cluster.yaml`:

```yaml
spec:
  coordinator:
    metastore:
      hostPath: /home/alluxio/metastore
    log:
      hostPath: /home/alluxio/logs
  worker:
    pagestore:
      hostPath: /home/alluxio/pagestore
    metastore:
      hostPath: /home/alluxio/metastore
    log:
      hostPath: /home/alluxio/logs
  fuse:
    log:
      hostPath: /home/alluxio/logs/fuse
    hostPathForMigration: /home/alluxio/migration
  gateway:
    log:
      hostPath: /home/alluxio/logs/gateway
  prometheus:
    persistence:
      hostPath: /home/alluxio/prometheus
```

Additionally, configure worker identity persistence to prevent workers from registering as new instances after each restart (leaving stale OFFLINE workers behind). For the full explanation of why this matters and the impact on the hash ring, see [Restarting a Worker](/ee-ai-en/administration/managing-ring.md#restarting-a-worker).

```yaml
spec:
  worker:
    useExternalId: false
    systemInfo:
      hostPath: /home/alluxio/system-info
```

#### kind (Local Development)

**Image loading**: `kind load docker-image` can fail for multi-platform images with digest errors. Use the following workaround:

```shell
docker save <IMAGE> | docker exec -i <KIND_CONTAINER> ctr --namespace=k8s.io images import --snapshotter=overlayfs -
```

To find the kind container name: `docker ps | grep kindest`.

## Related Documentation

* [How Alluxio Works](/ee-ai-en/how-alluxio-works.md) — Architecture overview, consistent hashing, and failover behavior
* [Cluster Management](/ee-ai-en/administration/managing-alluxio.md) — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management
* [Monitoring](/ee-ai-en/administration/monitoring-alluxio.md) — Grafana access, dashboard import, alert rules, and Datadog integration


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/start/installing-on-kubernetes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
