Installing on Kubernetes

This documentation shows how to install Alluxio on Kubernetes via Operator, a Kubernetes extension for managing applications.

Installation Steps

1. Preparation

Before you begin, ensure you have reviewed the Resource Prerequisites and Compatibility.

It is assumed the required container images, both for Alluxio and third party components, are accessible to the Kubernetes cluster. If your cluster cannot access public image repositories, see Appendix A: Handling Images for instructions.

First, download and extract the operator helm chart:

# The command will extract the files to the directory alluxio-operator/
$ tar zxf alluxio-operator-3.3.2-helmchart.tgz

This creates the alluxio-operator directory containing the Helm chart.

2. Deploy Alluxio Operator

Create an alluxio-operator/alluxio-operator.yaml file to specify the operator image.

global:
  image: <PRIVATE_REGISTRY>/alluxio-operator
  imageTag: 3.3.2

Now, deploy the operator using Helm:

$ cd alluxio-operator
# The last parameter is the directory to the helm chart; "." means the current directory
$ helm install operator -f alluxio-operator.yaml .

Verify that the operator pods are running:

$ kubectl -n alluxio-operator get pod
NAME                                              READY   STATUS    RESTARTS   AGE
alluxio-cluster-controller-5647cc664d-lrx84       1/1     Running   0          14s
alluxio-collectinfo-controller-667b746fd6-hfzqk   1/1     Running   0          14s
alluxio-csi-controller-7bd66df6cf-7kh6k           2/2     Running   0          14s
alluxio-csi-nodeplugin-9cc9v                      2/2     Running   0          14s
...

If the operator pods fail to start due to image pull errors, your Kubernetes cluster may not have access to public image registries. Please refer to Appendix A.2: Unable to access public image registry.

3. Deploy Alluxio Cluster

Deploying a production-ready Alluxio cluster involves labeling nodes, creating a configuration file, and then deploying the resources.

First, label the Kubernetes nodes where you want to run the Alluxio coordinator and workers. This ensures that Alluxio pods are scheduled on the appropriate machines.

kubectl label nodes <node-name> alluxio-role=coordinator
kubectl label nodes <node-name> alluxio-role=worker

Second, create an alluxio-cluster.yaml file. Before deploying, it is important to review and define several key configurations to match your environment's needs:

License Management: A cluster license is the simplest way to get started. For production environments, a deployment license is recommended. See Appendix E: License Management for details on both options.
Hash Ring Configuration: It is critical to configure the hash ring before deployment, as changes can be destructive. For detailed guidance, see Appendix B: Handling Hash Ring.
Heterogeneous Clusters: If your cluster includes workers with different capacities, you must define a specific data distribution strategy. See Appendix C: Handling Heterogeneous Workers for configuration steps.
Advanced Configuration: For other settings, such as resource tuning or using an external etcd, refer to Appendix D: Advanced Configuration.

The following example provides a basic configuration that includes a nodeSelector and a persistent metastore:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.7-13.0.0
  properties:
    alluxio.license: <YOUR_CLUSTER_LICENSE>
  coordinator:
    nodeSelector:
      alluxio-role: coordinator
    metastore:
      type: persistentVolumeClaim
      storageClass: "gp2"
      size: 4Gi
  worker:
    nodeSelector:
      alluxio-role: worker
    count: 2
    pagestore:
      size: 100Gi

Third, deploy the Alluxio cluster using your configuration file.

$ kubectl create namespace alx-ns
$ kubectl create -f alluxio-cluster.yaml

Finally, check the status of the cluster. It may take a few minutes for all pods to become Ready.

# Check the cluster status
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          2m18s

# Check the running pods
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-coordinator-0                 1/1     Running   0          2m3s
alluxio-cluster-etcd-0                        1/1     Running   0          2m3s
...
alluxio-cluster-worker-85fd45db46-c7n9p       1/1     Running   0          2m3s
...

If any component fails to start, refer to Appendix B: Troubleshooting for guidance.

4. Connect to Storage

Alluxio unifies access to your existing data by connecting to various storage systems, known as Under File Systems (UFS). You can mount a UFS by creating an UnderFileSystem resource.

For a complete list of supported storage systems, see the Connecting to Storage guide.

The following example shows how to mount an S3 bucket. Create a ufs.yaml file:

apiVersion: k8s-operator.alluxio.com/v1
kind: UnderFileSystem
metadata:
  name: alluxio-s3
  namespace: alx-ns
spec:
  alluxioCluster: alluxio-cluster
  path: s3://<S3_BUCKET>/<S3_DIRECTORY>
  mountPath: /s3
  mountOptions:
    s3a.accessKeyId: <S3_ACCESS_KEY_ID>
    s3a.secretKey: <S3_SECRET_KEY>
    alluxio.underfs.s3.region: <S3_REGION>

Apply the configuration to mount the storage:

$ kubectl create -f ufs.yaml

Verify the mount status:

# Verify the UFS resource is ready
$ kubectl -n alx-ns get ufs
NAME         PHASE   AGE
alluxio-s3   Ready   46s

# Check the mount table in Alluxio
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio mount list 2>/dev/null
Listing all mount points
s3://my-bucket/path/to/mount  on  /s3/ properties={...}

5. Access Data

Alluxio provides several APIs for applications to access data. For a general overview, see Accessing Data.

POSIX API via FUSE: The most common method for ML/AI workloads. Mount Alluxio as a local filesystem. See the FUSE guide.
S3 API: Ideal for applications already using S3 SDKs. See the S3 API guide.
Python API via FSSpec: A native Pythonic interface for data science libraries. See the FSSpec guide.

Appendix

A. Handling Images

Two types of container images are required for deployment:

Alluxio images: Provided by your Alluxio sales representative.
Third-party images: For components like etcd and CSI plugins, typically pulled from public registries.

All images must be accessible to your Kubernetes cluster. If your cluster is in an air-gapped environment or cannot access public registries, you must pull all necessary images and push them to your private registry.

A.1. Alluxio Images

The primary Alluxio images are:

alluxio-operator-3.3.2-docker.tar
alluxio-enterprise-AI-3.7-13.0.0-docker.tar

Load and push them to your private registry:

# Load the images locally
$ docker load -i alluxio-operator-3.3.2-docker.tar
$ docker load -i alluxio-enterprise-AI-3.7-13.0.0-docker.tar

# Retag the images for your private registry
$ docker tag alluxio/operator:3.3.2 <PRIVATE_REGISTRY>/alluxio-operator:3.3.2
$ docker tag alluxio/alluxio-enterprise:AI-3.7-13.0.0 <PRIVATE_REGISTRY>/alluxio-enterprise:AI-3.7-13.0.0

# Push to the remote registry
$ docker push <PRIVATE_REGISTRY>/alluxio-operator:3.3.2
$ docker push <PRIVATE_REGISTRY>/alluxio-enterprise:AI-3.7-13.0.0

A.2. Unable to access public image registry

If your cluster cannot pull from public registries, you will see pods stuck in ContainerCreating or ImagePullBackOff status. You must manually pull, retag, and push the required third-party images to your private registry.

Third-Party Dependent Images

Component

Image Name

Version

operator CSI

registry.k8s.io/sig-storage/csi-node-driver-registrar

v2.0.0

operator CSI

registry.k8s.io/sig-storage/csi-provisioner

v2.0.5

cluster ETCD

docker.io/bitnami/etcd

3.5.9-debian-11-r24

cluster ETCD

docker.io/bitnami/os-shell

11-debian-11-r2

cluster monitor

grafana/grafana

10.4.5

cluster monitor

prom/prometheus

v2.52.0

Commands to Relocate Images

# Pull the Docker images (specify --platform if needed)
$ docker pull registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.0.0
$ docker pull registry.k8s.io/sig-storage/csi-provisioner:v2.0.5
...

# Tag the images with your private registry
$ docker tag registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.0.0 <PRIVATE_REGISTRY>/csi-node-driver-registrar:v2.0.0
...

# Push the images to your private registry
$ docker push <PRIVATE_REGISTRY>/csi-node-driver-registrar:v2.0.0
...

Update YAML Files

Update alluxio-operator.yaml and alluxio-cluster.yaml to point to the images in your private registry.

alluxio-operator.yaml Example:

global:
  image: <PRIVATE_REGISTRY>/alluxio-operator
  imageTag: 3.3.2
alluxio-csi:
  controllerPlugin: 
    provisioner: 
      image: <PRIVATE_REGISTRY>/csi-provisioner:v2.0.5
  nodePlugin: 
    driverRegistrar: 
        image: <PRIVATE_REGISTRY>/csi-node-driver-registrar:v2.0.0

alluxio-cluster.yaml Example:

spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.7-13.0.0
  etcd:
    image:
      registry: <PRIVATE_REGISTRY>
      repository: <PRIVATE_REPOSITORY>/etcd
      tag: 3.5.9-debian-11-r24
...

B. Handling Hash Ring

The consistent hash ring determines how data is mapped to workers. It is critical to define your hash ring strategy before deploying the cluster, as changing these settings later is a destructive operation that will cause all cached data to be lost.

Key properties to consider, which should be set in alluxio-cluster.yaml under .spec.properties:

Hash Ring Mode (alluxio.user.dynamic.consistent.hash.ring.enabled):
- true (Default): Dynamic mode. Includes only online workers. Best for most environments.
- false: Static mode. Includes all registered workers, online or offline. Use if you need a stable ring view despite temporary worker unavailability.
Virtual Nodes (alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker):
- Default: 2000. Controls load balancing granularity.
Worker Capacity (alluxio.user.worker.selection.policy.consistent.hash.provider.impl):
- DEFAULT (Default): Assumes all workers have equal capacity.
- CAPACITY: Allocates virtual nodes based on worker storage capacity. Use this for heterogeneous clusters.

For more details, see Hash Ring Management.

C. Handling Heterogeneous Workers

The Alluxio operator allows you to manage heterogeneous worker configurations, which is particularly useful for clusters where nodes have different disk specifications. This feature enables you to define distinct worker groups, each with its own storage settings.

Note: While this provides flexibility, it is crucial to ensure consistency within each worker group. Misconfigurations can lead to unexpected errors. This guide covers the supported use case of configuring workers with different disk setups.

To set up heterogeneous workers, follow these steps:

Group Nodes by Specification: First, identify and group your Kubernetes nodes based on their disk configurations. For example, you might have one group of 10 nodes with a single 1TB disk and another group of 12 nodes with two 800GB disks.

Label the Nodes: Assign unique labels to each group of nodes. This allows you to target specific configurations to the correct machines.

# Label nodes with one disk
$ kubectl label nodes <node-name> apps.alluxio.com/disks=1
# Label nodes with two disks
$ kubectl label nodes <node-name> apps.alluxio.com/disks=2

Define Worker Groups and Enable Capacity-Based Hashing: In your alluxio-cluster.yaml, use the .spec.workerGroups field to define each group. Use a nodeSelector to apply the specific configuration to the nodes with the corresponding label.

For heterogeneous clusters, it is also recommended to configure the hash ring to be capacity-aware. This ensures that workers with more storage capacity are allocated a proportionally larger share of data. You can do this by setting alluxio.user.worker.selection.policy.consistent.hash.provider.impl to CAPACITY.

The example below shows a complete configuration for a heterogeneous cluster:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.user.worker.selection.policy.consistent.hash.provider.impl: CAPACITY

  # Common configurations for all workers
  worker:
    resources:
      limits:
        memory: 40Gi
      requests:
        memory: 36Gi
    jvmOptions: ["-Xmx20g", "-Xms20g", "-XX:MaxDirectMemorySize=16g"]
  
  # Define specific configurations for each worker group
  workerGroups:
  - worker:
      count: 10
      nodeSelector:
        apps.alluxio.com/disks: 1
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore
        size: 1Ti
  - worker:
      count: 12
      nodeSelector:
        apps.alluxio.com/disks: 2
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore,/mnt/disk2/alluxio/pagestore
        size: 800Gi,800Gi

D. Advanced Configuration

This section describes common configurations to adapt to different scenarios.

D.1. Configuring Alluxio Properties

To modify Alluxio's configuration, edit the .spec.properties field in the alluxio-cluster.yaml file. These properties are appended to the alluxio-site.properties file inside the Alluxio pods.

D.2. Resource and JVM Tuning

You can configure resource limits and JVM options for each component.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    count: 2
    resources:
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "1"
        memory: "32Gi"
    jvmOptions:
      - "-Xmx22g"
      - "-Xms22g"
      - "-XX:MaxDirectMemorySize=10g"
  coordinator:
    resources:
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "1"
        memory: "32Gi"
    jvmOptions:
      - "-Xmx4g"
      - "-Xms1g"

The container's total memory limit should be slightly more than the sum of its heap size (-Xmx) and direct memory size (-XX:MaxDirectMemorySize) to avoid out-of-memory errors.

D.3. Use PVC for Page Store

To persist worker cache data, specify a PersistentVolumeClaim (PVC) for the page store.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    pagestore:
      type: persistentVolumeClaim
      storageClass: "" # Defaults to "standard", can be empty for static binding
      size: 100Gi
      reservedSize: 10Gi # Recommended 5-10% of cache size

D.4. Mount Custom ConfigMaps or Secrets

You can mount custom ConfigMap or Secret files into your Alluxio pods. This is useful for providing configuration files like core-site.xml or credentials.

Example: Mount a Secret

Create the secret from a local file:

kubectl -n alx-ns create secret generic my-secret --from-file=/path/to/my-file

Specify the secret to load and the mount path in alluxio-cluster.yaml:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  secrets:
    worker:
      my-secret: /opt/alluxio/secret
    coordinator:
      my-secret: /opt/alluxio/secret

The file my-file will be available at /opt/alluxio/secret/my-file on the pods.

D.5. Use External ETCD

If you have an external ETCD cluster, you can configure Alluxio to use it instead of the one deployed by the operator.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  etcd:
    enabled: false
  properties:
    alluxio.etcd.endpoints: http://external-etcd:2379
    # If using TLS for ETCD, add the following:
    # alluxio.etcd.tls.enabled: "true"

D.6. Customize ETCD configuration

The fields under spec.etcd follow the Bitnami ETCD helm chart. For example, to set node affinity for etcd pods, the affinity field can be used as described in the Kubernetes documentation.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  etcd:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
              - antarctica-east1
              - antarctica-west1

E. License Management

Alluxio requires a license provided by your sales representative. There are two types: a cluster license (for single test clusters) and a deployment license (recommended for production).

E.1. Cluster License

A cluster license is set directly in the alluxio-cluster.yaml file. This method is not recommended for production.

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.license: <YOUR_CLUSTER_LICENSE>

E.2. Deployment License

A deployment license is the recommended method for production and can cover multiple clusters. It is applied by creating a separate License resource after the cluster has been created.

Step 1: Create the Cluster without a License Deploy the Alluxio cluster as described in Step 3 of the main guide, but do not include the alluxio.license property in alluxio-cluster.yaml. The pods will start but remain in an Init state, waiting for the license.

Step 2: Apply the License Create an alluxio-license.yaml file as shown in Step 4 of the main guide. The name and namespace in this file must match the metadata of your AlluxioCluster.

apiVersion: k8s-operator.alluxio.com/v1
kind: License
metadata:
  name: alluxio-license
  namespace: alx-ns
spec:
  clusters:
  - name: alluxio-cluster
    namespace: alx-ns
  licenseString: <YOUR_DEPLOYMENT_LICENSE>

Apply this file with kubectl create -f alluxio-license.yaml. The Alluxio pods will detect the license and transition to Running.

Warning: Only specify running clusters in the clusters list. If the operator cannot find a listed cluster, the license operation will fail for all clusters.

E.3. Updating a Deployment License

To update an existing deployment license, update the licenseString in your alluxio-license.yaml and re-apply it:

kubectl delete -f alluxio-license.yaml
kubectl create -f alluxio-license.yaml

E.4. Checking License Status

You can check the license details and utilization from within the Alluxio coordinator pod.

# Get a shell into the coordinator pod
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- /bin/bash

# View license details (expiration, capacity)
$ alluxio license show

# View current utilization
$ alluxio license status

F. Troubleshooting

F.1. etcd pod stuck in pending status

If etcd pods are Pending, it is often due to storage issues. Use kubectl describe pod <etcd-pod-name> to check events.

Symptom: Event message shows pod has unbound immediate PersistentVolumeClaims.

Cause: No storageClass is set for the PVC, or no PV is available.

Solution: Specify a storageClass in alluxio-cluster.yaml:

spec:
  etcd:
    persistence:
      storageClass: <YOUR_STORAGE_CLASS>
      size: 10Gi # Example size

Then, delete the old cluster and PVCs before recreating the cluster.

Symptom: Event message shows waiting for first consumer.

Cause: The storageClass does not support dynamic provisioning, and a volume must be manually created by an administrator.

Solution: Either use a dynamic provisioner or manually create a PersistentVolume that satisfies the claim.

F.2. alluxio-cluster-fuse PVC in pending status

The alluxio-cluster-fuse PVC remaining in a Pending state is normal. It will automatically bind to a volume and become Bound once a client application pod starts using it.

Last updated 22 days ago