Managing Alluxio

This guide provides a comprehensive overview of administrative operations for managing a running Alluxio cluster on Kubernetes. It covers day-to-day tasks such as configuration updates, scaling, and upgrades, as well as advanced topics like namespace and multi-tenancy management.

1. Cluster Lifecycle and Configuration

This section covers fundamental operations related to the cluster's lifecycle, such as scaling, upgrades, and dynamic configuration updates.

Scaling the Cluster

You can dynamically scale the number of Alluxio workers up or down to adjust to workload changes.

To Scale Up Workers:

  1. Modify your alluxio-cluster.yaml file and increase the count under the worker section. The example below scales from 2 to 3 workers.

  2. Apply the change to your cluster.

# Apply the changes to Kubernetes
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# Verify the new worker pods are being created
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS            RESTARTS   AGE
...
alluxio-cluster-worker-58999f8ddd-p6n59       0/1     PodInitializing   0          4s

# Wait for all workers to become ready
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running   0          5m21s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running   0          4m21s
alluxio-cluster-worker-58999f8ddd-p6n59       1/1     Running   0          34s

Upgrading Alluxio

The upgrade process involves two main steps: upgrading the Alluxio Operator and then upgrading the Alluxio cluster itself.

Step 1: Upgrade the Operator

The operator is stateless and can be safely re-installed without affecting the running Alluxio cluster.

  1. Obtain the new Docker images for the operator and the new Helm chart.

  2. Uninstall the old operator and install the new one.

# Uninstall the current operator
$ helm uninstall operator
release "operator" uninstalled

# Ensure the operator namespace is fully removed
$ kubectl get ns alluxio-operator
Error from server (NotFound): namespaces "alluxio-operator" not found

# Replace the new CRDs from the new Helm chart directory and create the new ones
$ kubectl create -f alluxio-operator/crds 2>/dev/null
$ kubectl replace -f alluxio-operator/crds 2>/dev/null
customresourcedefinition.apiextensions.k8s.io/alluxioclusters.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/clustergroups.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/collectinfoes.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/licenses.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/underfilesystems.k8s-operator.alluxio.com replaced

# Install the new operator using your configuration file (update the image tag)
$ helm install operator -f operator-config.yaml alluxio-operator

Step 2: Upgrade the Alluxio Cluster

The operator will perform a rolling upgrade of the Alluxio components.

  1. Upload the new Alluxio Docker images to your registry.

  2. Update the imageTag in your alluxio-cluster.yaml to the new version.

  3. Apply the configuration change.

# Apply the updated cluster definition
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# Monitor the rolling upgrade process
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS     RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:0/2   0          7s
...
alluxio-cluster-worker-58999f8ddd-cd6r2       0/1     Init:0/2   0          7s
alluxio-cluster-worker-5d6786f5bf-cxv5j       1/1     Running    0          10m

# Check the cluster status until it returns to 'Ready'
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Updating       10m
...
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          12m

# Verify the new version is running
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio info version 2>/dev/null
AI-3.7-13.0.0

Dynamically Updating Configuration

You can change Alluxio properties in a running cluster by editing its ConfigMap.

  1. Find the ConfigMap for your cluster.

    $ kubectl -n alx-ns get configmap
    NAME                              DATA   AGE
    alluxio-cluster-alluxio-conf      4      7m48s
    ...
  2. Edit the ConfigMap to modify alluxio-site.properties, alluxio-env.sh, etc.

    $ kubectl -n alx-ns edit configmap alluxio-cluster-alluxio-conf
  3. Restart components to apply the changes.

    • Coordinator: kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator

    • Workers: kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker

    • DaemonSet FUSE: kubectl -n alx-ns rollout restart daemonset alluxio-fuse

    • CSI FUSE: These pods must be restarted by exiting the application pod or by manually deleting the FUSE pod (kubectl -n alx-ns delete pod <fuse-pod-name>).

2. Worker Management

Alluxio's decentralized architecture relies on workers that are managed via a consistent hash ring.

Checking Worker Status

To see a list of all registered workers and their current status (online or offline):

bin/alluxio info nodes

Adding a New Worker

To add a new worker to the cluster:

  1. Install the Alluxio software on the new node.

  2. Ensure the alluxio-site.properties file is configured to point to your etcd cluster.

  3. Start the worker process. It will automatically register itself in etcd and join the consistent hashing ring.

Removing a Worker Permanently

If you need to decommission a worker permanently:

  1. Shut down the worker process on the target node.

  2. Get the Worker ID by running bin/alluxio info nodes.

  3. Remove the worker using its ID.

    bin/alluxio process remove-worker -n <worker_id>
  4. Verify removal by running bin/alluxio info nodes again.

Important: Removing a worker is a permanent action that will cause its portion of the hash ring to be redistributed, potentially causing a temporary increase in cache misses.

Restarting a Worker

If you restart a worker for maintenance, it will be temporarily marked as offline. As long as its identity is preserved (via alluxio.worker.identity.uuid.file.path), it will rejoin the cluster with its cached data intact and available.

3. UFS Mount Management

Alluxio's unified namespace allows you to mount multiple under storage systems (UFS) into a single logical view. The mount table that manages these connections is stored in etcd for high availability. Alluxio components periodically poll etcd for the latest mount table, so any changes are automatically propagated throughout the cluster.

Managing Mount Points

Use UnderFileSystem configuration to manage UFS mounts. To add a new mount, see Connect to Storage guide.

  • List Mounts:

    # You can also list all submitted configurations with Kubernetes CLI.
    $ kubectl -n alx-ns get ufs
    NAME         PHASE   AGE
    alluxio-s3   Ready   13d
  • Remove a Mount:

    $ kubectl -n alx-ns delete ufs alluxio-s3
    underfilesystem.k8s-operator.alluxio.com "alluxio-s3" deleted

Configuring UFS Credentials

You can provide credentials for a UFS globally or on a per-mount basis.

  • Global Configuration: Set properties for all mounts of a certain type (e.g., all S3 mounts) in alluxio-site.properties.

    # alluxio-site.properties
    s3a.accessKeyId=<S3_ACCESS_KEY>
    s3a.secretKey=<S3_SECRET_KEY>
  • Per-Mount Configuration (Recommended): Provide credentials as options during the mount command. This is the most flexible and secure method, ideal for connecting to multiple systems with different credentials. Per-mount options override global settings.

    apiVersion: k8s-operator.alluxio.com/v1
    kind: UnderFileSystem
    metadata:
      name: alluxio-s3
      namespace: alx-ns
    spec:
      alluxioCluster: alluxio-cluster
      path: s3://bucket-a/images
      mountPath: /s3-images
      mountOptions:
        s3a.accessKeyId: <S3_ACCESS_KEY_ID>
        s3a.secretKey: <S3_SECRET_KEY>

Rules for Mounting

When defining your namespace, you must follow two important rules to ensure a valid and unambiguous mount table.

Rule 1: Mounts Must Be Direct Children of the Root (/)

You can only create mount points at the top level of the Alluxio namespace. You cannot mount to the root path (/) itself, nor can you create a mount point inside a non-existent directory.

Examples:

Action
Alluxio Path
UFS Path
Valid?
Reason

Mount a bucket

/s3-data

s3://my-bucket/

✔️ Yes

Mount point is a direct child of root.

Mount to root

/

s3://my-bucket/

❌ No

The root path cannot be a mount point.

Mount to a sub-path

/data/images

s3://my-bucket/images/

❌ No

Mount points cannot be created in subdirectories.

Rule 2: Mounts Cannot Be Nested

One mount point cannot be created inside another, either in the Alluxio namespace or in the UFS namespace. For example, if /data is mounted to s3://my-bucket/data, you cannot create a new mount at /data/tables (nested Alluxio path) or mount another UFS to s3://my-bucket/data/tables (nested UFS path).

Example Scenario:

Suppose you have an existing mount point:

  • Alluxio Path: /data

  • UFS Path: s3://my-bucket/data

The following new mounts would be invalid:

New Alluxio Path
New UFS Path
Valid?
Reason for Rejection

/data/tables

hdfs://namenode/tables

❌ No

The Alluxio path /data/tables is nested inside the existing /data mount.

/tables

s3://my-bucket/data/tables

❌ No

The UFS path s3://.../data/tables is nested inside the existing s3://.../data mount.

4. Multi-Tenancy and Federation

For large-scale enterprise deployments, Alluxio provides advanced features for multi-tenancy and cluster federation. This allows multiple teams and business units to share data infrastructure securely and efficiently while simplifying administrative overhead.

The reference architecture below features an API Gateway that centrally handles authentication and authorization across multiple Alluxio clusters.

Core Concepts

Authentication

Alluxio integrates with external enterprise identity providers like Okta. When a user logs in, the provider authenticates them and generates a JSON Web Token (JWT). This JWT is then sent with every subsequent request to the Alluxio API Gateway to verify the user's identity.

Authorization

Once a user is authenticated, Alluxio uses an external policy engine, Open Policy Agent (OPA), to determine what actions the user is authorized to perform. Administrators can write fine-grained access control policies in OPA's declarative language, Rego, to control which users can access which resources. The API Gateway queries OPA for every request to ensure it is authorized.

Multi-Tenancy and Isolation

Alluxio enforces isolation between tenants to ensure security and prevent interference. This is achieved through:

  • User Roles: Defining different roles with specific access levels and permissions.

  • Cache Isolation: Assigning tenant-specific cache configurations, including quotas, TTLs, and eviction policies, ensuring one tenant's workload does not negatively impact another's.

Cluster Federation

For organizations with multiple Alluxio clusters (e.g., across different regions or for different business units), federation simplifies management. A central Management Console provides a single pane of glass for:

  • Cross-cluster monitoring and metrics.

  • Executing operations across multiple clusters simultaneously.

  • Centralized license management for all clusters.

Example Workflow: Updating a Cache Policy

This workflow demonstrates how the components work together:

  1. Authentication: A user logs into the Management Console, which redirects them to Okta for authentication. Upon success, Okta issues a JWT.

  2. Request Submission: The user uses the console to submit a request to change a cache TTL. The request, containing the JWT, is sent to the API Gateway.

  3. Authorization: The API Gateway validates the JWT and queries the OPA Policy Engine to check if the user has permission to modify cache settings for the target tenant.

  4. Execution: If the request is authorized, the API Gateway forwards the command to the coordinator of the relevant Alluxio cluster, which then applies the new TTL policy.

Last updated