Kubernetes Installation

This documentation shows how to install Alluxio on Kubernetes via Operatorarrow-up-right.

Overview

Artifacts

Your will receive download links for three artifacts:

Artifact
Filename
Purpose

Helm chart

alluxio-operator-3.5.2-helmchart.tgz

Deploys the Operator onto Kubernetes

Operator image

alluxio-operator-3.5.2-linux-amd64-docker.tar

Container image for the Operator pod

Alluxio image

alluxio-enterprise-AI-3.8-15.1.2-linux-amd64-docker.tar

Container image for Alluxio worker and coordinator pods

License

Required to activate the cluster

Version: The version strings in these filenames are examples. The download link you receive will contain the exact version your sales representative provisioned. Platform: Use -linux-amd64-docker.tar for x86 nodes or -linux-arm64-docker.tar for ARM nodes.

The Docker images are never pulled from a public registry — they are loaded from the .tar files and pushed to your private registry before deployment.

Helm chart (.tgz)
  └─► deploys ─► Operator pod  (alluxio-operator image)
                    └─► watches AlluxioCluster CRD
                           └─► creates ─► Alluxio pods  (alluxio-enterprise image)

Kubernetes Components

A deployed Alluxio cluster consists of:

  • Operator — Manages the lifecycle of Alluxio clusters. Installed once per Kubernetes cluster.

  • Coordinator — Handles background operations (data loading, freeing). 1 replica.

  • Workers — Cache data and serve reads via S3 API or FUSE. Scale horizontally for more cache capacity.

  • ETCD — Service discovery and mount table storage. 3 replicas recommended for quorum.

  • Monitoring (optional) — Prometheus and Grafana. Enabled by default.

Before You Start

Run these checks before starting (~2 minutes). Skipping this step is the most common cause of deployment failures.

For resource sizing (CPU, RAM, cache disk per component), see Prerequisites → Resource Sizing.

circle-exclamation

Installation Steps

0. Push Alluxio Images to Your Private Registry

Skip this step if the Alluxio images are already present in your private registry.

Alluxio images are delivered as .tar files and must be loaded and pushed to your private registry before the Helm chart can deploy them.

Load the images into your local Docker:

✅ Success: docker images shows both images:

Retag and push to your private registry:

✅ Success: Both images are now in your private registry. Note down the full image paths — you will use them in Steps 1 and 4.

If your Kubernetes cluster also cannot reach public registries (air-gapped), third-party images (etcd, CSI) also need to be relocated. See Appendix A: Air-Gapped Deployment.

1. Prepare Helm Chart

Extract the Helm chart:

Create alluxio-operator.yaml (outside the chart directory) to specify the operator image from your private registry:

2. Create Namespace

3. Deploy Operator

✅ Success: Helm prints STATUS: deployed immediately after the command completes:

Then verify all pods are running:

✅ Success: All operator pods show READY 1/1 or 2/2, STATUS = Running, and RESTARTS = 0.

An example output is like:

If pods fail with image pull errors on etcd or CSI images, see Appendix A: Air-Gapped Deployment.

4. Deploy Cluster

circle-exclamation

Create a minimal alluxio-cluster.yaml:

Deploy:

✅ Success: Startup typically takes 2–3 minutes. The first deployment may take longer if the Alluxio image (~1.8 GB) needs to be pulled from the registry. To watch progress in real time:

Once all pods are running, kubectl -n alx-ns get alluxiocluster shows CLUSTERPHASE = Ready

and kubectl -n alx-ns get pod shows all pods to become Ready and STATUS = Running.

If any component fails to start, see Appendix D: Troubleshooting.

To access the Grafana dashboard and import the Alluxio dashboard template, see Monitoring.

5. Mount Storage

Create ufs.yaml (S3 example; for other storage systems, see Underlying Storage):

Apply:

✅ Success: kubectl -n alx-ns get ufs shows PHASE = Ready.

Example out:

6. Verify Cluster

✅ Success: Output displays your mount point (e.g., s3://my-bucket/... on /s3/).

Example out:

7. Verify Data Access

✅ Success: Returns a directory listing without errors.

Next: connect your application to the cluster by picking an access method:

Uninstall

To remove the Alluxio deployment from your cluster, run the following commands in order:

1. Delete the UFS mount and cluster:

2. Uninstall the operator:

3. Delete the namespaces:

✅ Success: kubectl get namespace no longer shows alx-ns or alluxio-operator, and kubectl get alluxiocluster -A returns No resources found.

Next Steps: Production Setup

The configuration above is suitable for evaluation. For production deployments (node pinning, resource tuning, persistent metastore, license, hash-ring pre-configuration, heterogeneous workers, etc.), see Production Setup.


Appendix

Use the table below to find the relevant appendix section for your scenario:

Scenario
Sections

Air-gapped (cluster cannot reach public registries)

External or custom ETCD

Production licensing

Something went wrong

A. Air-Gapped Deployment

Pre-flight: Verify Node-to-Registry Connectivity

Before deploying, verify that your worker nodes can pull images from your private registry. Catching this early avoids ImagePullBackOff failures mid-deployment.

SSH into each worker node and run:

crictl is available on any node running containerd or CRI-O. If the pull fails, resolve registry connectivity or authentication on the node before proceeding.

Symptom: ImagePullBackOff After Deploy

Symptom: After deploying the operator or cluster, some pods are stuck in ImagePullBackOff. Your Kubernetes cluster cannot reach public registries to pull third-party component images (CSI, etcd, monitoring).

Alluxio images are already in your private registry from Prepare: Push Alluxio Images. The remaining images to relocate depend on your specific operator version. Identify them by inspecting the stuck pods:

For each image that cannot be pulled: pull it from a machine with public internet access, retag it for your private registry, and push it. Then update alluxio-operator.yaml or alluxio-cluster.yaml to point to your private registry for that component.

The CSI images (part of the operator) can be overridden in alluxio-operator.yaml:

The etcd image (part of the cluster) can be overridden in alluxio-cluster.yaml:

B. Advanced Configuration

This section describes common configurations to adapt to different scenarios.

B.1. Configuring Alluxio Properties

To modify Alluxio's configuration, edit the .spec.properties field in the alluxio-cluster.yaml file. These properties are appended to the alluxio-site.properties file inside the Alluxio pods.

B.2. Mount Custom ConfigMaps or Secrets

You can mount custom ConfigMap or Secret files into your Alluxio pods. This is useful for providing configuration files like core-site.xml or credentials.

Example: Mount a Secret

  1. Create the secret from a local file:

  2. Specify the secret to load and the mount path in alluxio-cluster.yaml:

    The file my-file will be available at /opt/alluxio/secret/my-file on the pods.

B.3. Use External ETCD

If you have an external ETCD cluster, you can configure Alluxio to use it instead of the one deployed by the operator.

B.4. Customize ETCD configuration

The fields under spec.etcd follow the Bitnami ETCD helm chartarrow-up-right. For example, to set node affinity for etcd pods, the affinity field can be used as described in the Kubernetes documentationarrow-up-right.

B.5. Configure Image Pull Secrets

If your container images are stored in a private registry that requires authentication, you need to create a Kubernetes Secret to store your registry credentials.

This secret must be created in the namespace where you plan to install Alluxio.

C. License Management

Alluxio requires a license provided by your Alluxio sales representative. There are two license types:

  • Cluster license — scoped to a single cluster. Set inline via the alluxio.license property in alluxio-cluster.yaml. Suitable for both evaluation and single-cluster production deployments.

  • Deployment license — covers multiple clusters, where each cluster has its own independent capacity constraints within the license. Applied as a separate License Kubernetes resource. Use this when managing more than one Alluxio cluster.

C.1. Cluster License

Add the license string directly to alluxio-cluster.yaml:

C.2. Deployment License

A deployment license is applied as a separate License Kubernetes resource and can cover multiple clusters. Each cluster's cache capacity is measured independently against its own constraints defined within the deployment license — it is not a shared pool.

Step 1: Deploy the cluster without a license

Follow Step 5 of the main guide, but omit the alluxio.license property from alluxio-cluster.yaml. The pods start but remain in Init state until the license is applied.

Step 2: Apply the license

Create alluxio-license.yaml. The name and namespace in the clusters list must match the AlluxioCluster metadata.

The pods detect the license and transition to Running.

circle-exclamation

C.3. Updating a Deployment License

To update an existing deployment license, update the licenseString in your alluxio-license.yaml and re-apply it:

C.4. Checking License Status

You can check the license details and utilization from within the Alluxio coordinator pod.

D. Troubleshooting

D.1. etcd pod stuck in pending status

If etcd pods are Pending, it is often due to storage issues. Use kubectl describe pod <etcd-pod-name> to check events.

Symptom: Event message shows pod has unbound immediate PersistentVolumeClaims.

Cause: No storageClass is set for the PVC, or no PV is available.

Solution: Specify a storageClass in alluxio-cluster.yaml:

Then, delete the old cluster and PVCs before recreating the cluster.

Symptom: Event message shows waiting for first consumer.

Cause: The storageClass does not support dynamic provisioning, and a volume must be manually created by an administrator.

Solution: Either use a dynamic provisioner or manually create a PersistentVolume that satisfies the claim.

D.2. etcd pods stuck in Pending due to anti-affinity (fewer than 3 nodes)

Symptom: etcd pods are Pending with event message 0/N nodes are available: N node(s) didn't match pod anti-affinity rules.

Cause: The operator deploys etcd with requiredDuringSchedulingIgnoredDuringExecution anti-affinity by hostname. With etcd.replicaCount: 3 (the default), Kubernetes requires 3 distinct nodes. If your cluster has fewer than 3 nodes, etcd pods cannot be scheduled.

Solution: For dev/test clusters with fewer than 3 nodes, reduce the replica count:

Do not use replicaCount: 1 in production — a single etcd instance has no quorum and is not fault-tolerant.

D.3. alluxio-cluster-fuse PVC in pending status

The alluxio-cluster-fuse PVC remaining in a Pending state is normal. It will automatically bind to a volume and become Bound once a client application pod starts using it.

D.4. Worker pod stuck in CrashLoopBackOff

Symptom: Worker pod repeatedly crashes and restarts.

Start by checking the worker logs:

Common causes include:

  • Pagestore quota exceeds disk space — log shows quota (NNN) exceeds the total disk space. This commonly occurs because cloud providers advertise disk size in GB (base-10), while Kubernetes interprets Gi as GiB (base-2). Fix: reduce pagestore.size to ~90% of actual available space (df -h /mnt/alluxio) and reservedSize to ~10% of size.

  • License expired or invalid — log shows a license error. Fix: apply a new license. See Appendix C: License Management.

  • OOM killed — log shows Exit Code 137 or OutOfMemoryError. Fix: increase container memory limits and adjust -Xmx / -XX:MaxDirectMemorySize. See Worker Configuration — Resource and JVM Tuning.

E. Platform-Specific Notes

The main installation steps are platform-agnostic. This section documents known differences for specific Kubernetes environments.

Amazon EKS

EBS CSI driver required for EKS 1.23+: The in-tree EBS volume driver was removed in Kubernetes 1.23. On EKS 1.23+, install the AWS EBS CSI driverarrow-up-right add-on, or PVC provisioning will silently fail even if a StorageClass is listed.

Verify:

If no output, the driver is not installed.

Google GKE

Read-only /mnt path: GKE nodes have a read-only root filesystem. Multiple Alluxio components default to hostPaths under /mnt/alluxio/, causing worker pods to fail with:

Workaround: Redirect all hostPaths to a writable base directory (e.g., /home/alluxio/) in alluxio-cluster.yaml:

Additionally, configure worker identity persistence to prevent workers from registering as new instances after each restart (leaving stale OFFLINE workers behind). For the full explanation of why this matters and the impact on the hash ring, see Restarting a Worker.

kind (Local Development)

Image loading: kind load docker-image can fail for multi-platform images with digest errors. Use the following workaround:

To find the kind container name: docker ps | grep kindest.

  • How Alluxio Works — Architecture overview, consistent hashing, and failover behavior

  • Cluster Management — Post-deployment operations: scaling, hash ring tuning, worker lifecycle, and UFS mount management

  • Monitoring — Grafana access, dashboard import, alert rules, and Datadog integration

Last updated