Managing Alluxio

This guide provides a comprehensive overview of administrative operations for managing a running Alluxio cluster on Kubernetes. It covers day-to-day tasks such as configuration updates, scaling, and upgrades, as well as advanced topics like namespace and multi-tenancy management.

1. Cluster Lifecycle and Configuration

This section covers fundamental operations related to the cluster's lifecycle, such as scaling, upgrades, and dynamic configuration updates.

Scaling the Cluster

You can dynamically scale the number of Alluxio workers up or down to adjust to workload changes.

To Scale Up Workers:

Modify your alluxio-cluster.yaml file and increase the count under the worker section. The example below scales from 2 to 3 workers.
Apply the change to your cluster.

# Apply the changes to Kubernetes
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# Verify the new worker pods are being created
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS            RESTARTS   AGE
...
alluxio-cluster-worker-58999f8ddd-p6n59       0/1     PodInitializing   0          4s

# Wait for all workers to become ready
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running   0          5m21s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running   0          4m21s
alluxio-cluster-worker-58999f8ddd-p6n59       1/1     Running   0          34s

Upgrading Alluxio

The upgrade process involves two main steps: upgrading the Alluxio Operator and then upgrading the Alluxio cluster itself.

Step 1: Upgrade the Operator

The operator is stateless and can be safely re-installed without affecting the running Alluxio cluster.

Obtain the new Docker images for the operator and the new Helm chart.
Uninstall the old operator and install the new one.

# Uninstall the current operator
$ helm uninstall operator
release "operator" uninstalled

# Ensure the operator namespace is fully removed
$ kubectl get ns alluxio-operator
Error from server (NotFound): namespaces "alluxio-operator" not found

# Replace the new CRDs from the new Helm chart directory and create the new ones
$ kubectl create -f alluxio-operator/crds
$ kubectl replace -f alluxio-operator/crds
customresourcedefinition.apiextensions.k8s.io/alluxioclusters.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/clustergroups.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/collectinfoes.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/licenses.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/underfilesystems.k8s-operator.alluxio.com replaced

# Install the new operator using your configuration file (update the image tag)
$ helm install operator -f operator-config.yaml alluxio-operator

Step 2: Upgrade the Alluxio Cluster

The operator will perform a rolling upgrade of the Alluxio components.

Upload the new Alluxio Docker images to your registry.
Update the imageTag in your alluxio-cluster.yaml to the new version.
Apply the configuration change.

# Apply the updated cluster definition
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured

# Monitor the rolling upgrade process
$ kubectl -n alx-ns get pod
NAME                                          READY   STATUS     RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:0/2   0          7s
...
alluxio-cluster-worker-58999f8ddd-cd6r2       0/1     Init:0/2   0          7s
alluxio-cluster-worker-5d6786f5bf-cxv5j       1/1     Running    0          10m

# Check the cluster status until it returns to 'Ready'
$ kubectl -n alx-ns get alluxiocluster
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Updating       10m
...
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          12m

# Verify the new version is running
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio info version 2>/dev/null
AI-3.7-13.0.0

During the rolling upgrade, workers are restarted in batches, such that the workers in the current batch must be fully ready before the next batch starts. The default batch size is 10% of the workers.

The number of workers in a batch can be reduced to minimize the interruption to running workloads during the period, at the cost of extending the period. To control the proportion or set the exact number of workers to restart, set the following in alluxio-cluster.yaml:

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    rollingUpdate:
      maxUnavailable: 1  # by default this is 10%, but in addition to setting a percentage value, it can also be set to an exact number

Dynamically Updating Configuration

You can change Alluxio properties in a running cluster by editing its ConfigMap.

Find the ConfigMap for your cluster.

$ kubectl -n alx-ns get configmap
NAME                              DATA   AGE
alluxio-cluster-alluxio-conf      4      7m48s
...

Edit the ConfigMap to modify alluxio-site.properties, alluxio-env.sh, etc.
```
$ kubectl -n alx-ns edit configmap alluxio-cluster-alluxio-conf
```
Restart components to apply the changes.
- Coordinator: kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator
- Workers: kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker
- DaemonSet FUSE: kubectl -n alx-ns rollout restart daemonset alluxio-fuse
- CSI FUSE: These pods must be restarted by exiting the application pod or by manually deleting the FUSE pod (kubectl -n alx-ns delete pod <fuse-pod-name>).

2. Hash Ring Management

Important: Hash ring settings should be defined during the initial cluster setup. Modifying these configurations on a running cluster is a destructive operation that will cause all cached data to be lost, as it changes how data is mapped to workers.

Alluxio uses a consistent hash ring to map data to workers in a decentralized manner. You can fine-tune its behavior to optimize for different cluster environments and workloads.

Configuring the Hash Ring Mode

The consistent hash ring can operate in two modes: dynamic (default) or static.

In dynamic mode, the hash ring includes only online workers. When a worker goes offline, it is removed from the ring, and its virtual nodes are redistributed. This is ideal for most use cases, providing high adaptability.

In static mode, the hash ring includes all registered workers, regardless of their online status. This is useful in scenarios where you need a consistent view of all workers to minimize data re-fetching from UFS, even if some workers are temporarily unavailable.

To configure the mode, set the alluxio.user.dynamic.consistent.hash.ring.enabled property. Set it to true for dynamic mode (the default) or false for static mode.

Adjusting Virtual Nodes for Load Balancing

To ensure an even distribution of data and I/O requests, Alluxio uses virtual nodes. Each worker is mapped to multiple virtual nodes on the hash ring, which helps to balance the load more effectively across the cluster.

You can adjust the number of virtual nodes per worker by configuring the alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker property (default: 2000). Adjusting this value can help fine-tune load distribution, especially in clusters with diverse workloads or a small number of workers.

Optimizing for Heterogeneous Workers

By default, the consistent hashing algorithm assumes that all workers have equal capacity. In clusters with heterogeneous workers (e.g., different storage capacities or network speeds), you can enable capacity-based allocation for more balanced resource utilization. This ensures that workers with more storage handle a proportionally larger share of data.

To enable this, set the alluxio.user.worker.selection.policy.consistent.hash.provider.impl property to CAPACITY. The default value is DEFAULT, which allocates an equal number of virtual nodes to each worker.

3. Worker Management

Alluxio's decentralized architecture relies on workers that are managed via a consistent hash ring.

Checking Worker Status

To see a list of all registered workers and their current status (online or offline):

bin/alluxio info nodes

Adding a New Worker

To add a new worker to the cluster:

Install the Alluxio software on the new node.
Ensure the alluxio-site.properties file is configured to point to your etcd cluster.
Start the worker process. It will automatically register itself in etcd and join the consistent hashing ring.

Removing a Worker Permanently

If you need to decommission a worker permanently:

Shut down the worker process on the target node.
Get the Worker ID by running bin/alluxio info nodes.

Remove the worker using its ID.

bin/alluxio process remove-worker -n <worker_id>

Verify removal by running bin/alluxio info nodes again.

Important: Removing a worker is a permanent action that will cause its portion of the hash ring to be redistributed, potentially causing a temporary increase in cache misses.

Restarting a Worker

If you restart a worker for maintenance, it will be temporarily marked as offline. As long as its identity is preserved (via alluxio.worker.identity.uuid.file.path), it will rejoin the cluster with its cached data intact and available.

4. UFS Mount Management

Alluxio's unified namespace allows you to mount multiple under storage systems (UFS) into a single logical view. The mount table that manages these connections is stored in etcd for high availability. Alluxio components periodically poll etcd for the latest mount table, so any changes are automatically propagated throughout the cluster.

Managing Mount Points

Use UnderFileSystem configuration to manage UFS mounts. To add a new mount, see Connect to Storage guide.

List Mounts:

# You can also list all submitted configurations with Kubernetes CLI.
$ kubectl -n alx-ns get ufs
NAME         PHASE   AGE
alluxio-s3   Ready   13d

Remove a Mount:

$ kubectl -n alx-ns delete ufs alluxio-s3
underfilesystem.k8s-operator.alluxio.com "alluxio-s3" deleted

Configuring UFS Credentials

You can provide credentials for a UFS globally or on a per-mount basis.

Global Configuration: Set properties for all mounts of a certain type (e.g., all S3 mounts) in alluxio-site.properties.
```
# alluxio-site.properties
s3a.accessKeyId=<S3_ACCESS_KEY>
s3a.secretKey=<S3_SECRET_KEY>
```

Per-Mount Configuration (Recommended): Provide credentials as options during the mount command. This is the most flexible and secure method, ideal for connecting to multiple systems with different credentials. Per-mount options override global settings.

apiVersion: k8s-operator.alluxio.com/v1
kind: UnderFileSystem
metadata:
  name: alluxio-s3
  namespace: alx-ns
spec:
  alluxioCluster: alluxio-cluster
  path: s3://bucket-a/images
  mountPath: /s3-images
  mountOptions:
    s3a.accessKeyId: <S3_ACCESS_KEY_ID>
    s3a.secretKey: <S3_SECRET_KEY>

Rules for Mounting

When defining your namespace, you must follow two important rules to ensure a valid and unambiguous mount table.

Rule 1: Mounts Must Be Direct Children of the Root (`/`)

You can only create mount points at the top level of the Alluxio namespace. You cannot mount to the root path (/) itself, nor can you create a mount point inside a non-existent directory.

Examples:

Action

Alluxio Path

UFS Path

Valid?

Reason

Mount a bucket

/s3-data

s3://my-bucket/

✔️ Yes

Mount point is a direct child of root.

Mount to root

/

s3://my-bucket/

❌ No

The root path cannot be a mount point.

Mount to a sub-path

/data/images

s3://my-bucket/images/

❌ No

Mount points cannot be created in subdirectories.

Rule 2: Mounts Cannot Be Nested

One mount point cannot be created inside another, either in the Alluxio namespace or in the UFS namespace. For example, if /data is mounted to s3://my-bucket/data, you cannot create a new mount at /data/tables (nested Alluxio path) or mount another UFS to s3://my-bucket/data/tables (nested UFS path).

Example Scenario:

Suppose you have an existing mount point:

Alluxio Path: /data
UFS Path: s3://my-bucket/data

The following new mounts would be invalid:

New Alluxio Path

New UFS Path

Valid?

Reason for Rejection

/data/tables

hdfs://namenode/tables

❌ No

The Alluxio path /data/tables is nested inside the existing /data mount.

/tables

s3://my-bucket/data/tables

❌ No

The UFS path s3://.../data/tables is nested inside the existing s3://.../data mount.

5. Multi-Tenancy and Federation

For large-scale enterprise deployments, Alluxio provides advanced features for multi-tenancy and cluster federation. This allows multiple teams and business units to share data infrastructure securely and efficiently while simplifying administrative overhead.

The reference architecture below features an API Gateway that centrally handles authentication and authorization across multiple Alluxio clusters.

Core Concepts

Authentication

Alluxio integrates with external enterprise identity providers like Okta. When a user logs in, the provider authenticates them and generates a JSON Web Token (JWT). This JWT is then sent with every subsequent request to the Alluxio API Gateway to verify the user's identity.

Authorization

Once a user is authenticated, Alluxio uses an external policy engine, Open Policy Agent (OPA), to determine what actions the user is authorized to perform. Administrators can write fine-grained access control policies in OPA's declarative language, Rego, to control which users can access which resources. The API Gateway queries OPA for every request to ensure it is authorized.

Multi-Tenancy and Isolation

Alluxio enforces isolation between tenants to ensure security and prevent interference. This is achieved through:

User Roles: Defining different roles with specific access levels and permissions.
Cache Isolation: Assigning tenant-specific cache configurations, including quotas, TTLs, and eviction policies, ensuring one tenant's workload does not negatively impact another's.

Cluster Federation

For organizations with multiple Alluxio clusters (e.g., across different regions or for different business units), federation simplifies management. A central Management Console provides a single pane of glass for:

Cross-cluster monitoring and metrics.
Executing operations across multiple clusters simultaneously.
Centralized license management for all clusters.

Example Workflow: Updating a Cache Policy

This workflow demonstrates how the components work together:

Authentication: A user logs into the Management Console, which redirects them to Okta for authentication. Upon success, Okta issues a JWT.
Request Submission: The user uses the console to submit a request to change a cache TTL. The request, containing the JWT, is sent to the API Gateway.
Authorization: The API Gateway validates the JWT and queries the OPA Policy Engine to check if the user has permission to modify cache settings for the target tenant.
Execution: If the request is authorized, the API Gateway forwards the command to the coordinator of the relevant Alluxio cluster, which then applies the new TTL policy.

Last updated 22 days ago

1. Cluster Lifecycle and Configuration

Scaling the Cluster

Upgrading Alluxio

Step 1: Upgrade the Operator

Step 2: Upgrade the Alluxio Cluster

Dynamically Updating Configuration

2. Hash Ring Management

Configuring the Hash Ring Mode

Adjusting Virtual Nodes for Load Balancing

Optimizing for Heterogeneous Workers

3. Worker Management

Checking Worker Status

Adding a New Worker

Removing a Worker Permanently

Restarting a Worker

4. UFS Mount Management

Managing Mount Points

Configuring UFS Credentials

Rules for Mounting

Rule 1: Mounts Must Be Direct Children of the Root (/)

Rule 2: Mounts Cannot Be Nested

5. Multi-Tenancy and Federation

Core Concepts

Authentication

Authorization

Multi-Tenancy and Isolation

Cluster Federation

Example Workflow: Updating a Cache Policy

Rule 1: Mounts Must Be Direct Children of the Root (`/`)