Managing Alluxio
This guide provides a comprehensive overview of administrative operations for managing a running Alluxio cluster on Kubernetes. It covers day-to-day tasks such as configuration updates, scaling, and upgrades, as well as advanced topics like namespace and multi-tenancy management.
1. Cluster Lifecycle and Configuration
This section covers fundamental operations related to the cluster's lifecycle, such as scaling, upgrades, and dynamic configuration updates.
Scaling the Cluster
You can dynamically scale the number of Alluxio workers up or down to adjust to workload changes.
To Scale Up Workers:
Modify your
alluxio-cluster.yaml
file and increase thecount
under theworker
section. The example below scales from 2 to 3 workers.Apply the change to your cluster.
# Apply the changes to Kubernetes
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured
# Verify the new worker pods are being created
$ kubectl -n alx-ns get pod
NAME READY STATUS RESTARTS AGE
...
alluxio-cluster-worker-58999f8ddd-p6n59 0/1 PodInitializing 0 4s
# Wait for all workers to become ready
$ kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
NAME READY STATUS RESTARTS AGE
alluxio-cluster-worker-58999f8ddd-cd6r2 1/1 Running 0 5m21s
alluxio-cluster-worker-58999f8ddd-rtftk 1/1 Running 0 4m21s
alluxio-cluster-worker-58999f8ddd-p6n59 1/1 Running 0 34s
Upgrading Alluxio
The upgrade process involves two main steps: upgrading the Alluxio Operator and then upgrading the Alluxio cluster itself.
Step 1: Upgrade the Operator
The operator is stateless and can be safely re-installed without affecting the running Alluxio cluster.
Obtain the new Docker images for the operator and the new Helm chart.
Uninstall the old operator and install the new one.
# Uninstall the current operator
$ helm uninstall operator
release "operator" uninstalled
# Ensure the operator namespace is fully removed
$ kubectl get ns alluxio-operator
Error from server (NotFound): namespaces "alluxio-operator" not found
# Replace the new CRDs from the new Helm chart directory and create the new ones
$ kubectl create -f alluxio-operator/crds 2>/dev/null
$ kubectl replace -f alluxio-operator/crds 2>/dev/null
customresourcedefinition.apiextensions.k8s.io/alluxioclusters.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/clustergroups.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/collectinfoes.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/licenses.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/underfilesystems.k8s-operator.alluxio.com replaced
# Install the new operator using your configuration file (update the image tag)
$ helm install operator -f operator-config.yaml alluxio-operator
Step 2: Upgrade the Alluxio Cluster
The operator will perform a rolling upgrade of the Alluxio components.
Upload the new Alluxio Docker images to your registry.
Update the
imageTag
in youralluxio-cluster.yaml
to the new version.Apply the configuration change.
# Apply the updated cluster definition
$ kubectl apply -f alluxio-cluster.yaml
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured
# Monitor the rolling upgrade process
$ kubectl -n alx-ns get pod
NAME READY STATUS RESTARTS AGE
alluxio-cluster-coordinator-0 0/1 Init:0/2 0 7s
...
alluxio-cluster-worker-58999f8ddd-cd6r2 0/1 Init:0/2 0 7s
alluxio-cluster-worker-5d6786f5bf-cxv5j 1/1 Running 0 10m
# Check the cluster status until it returns to 'Ready'
$ kubectl -n alx-ns get alluxiocluster
NAME CLUSTERPHASE AGE
alluxio-cluster Updating 10m
...
NAME CLUSTERPHASE AGE
alluxio-cluster Ready 12m
# Verify the new version is running
$ kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio info version 2>/dev/null
AI-3.7-13.0.0
During the rolling upgrade, workers are restarted in batches, such that the workers in the current batch must be fully ready before the next batch starts. The default batch size is 10% of the workers.
The number of workers in a batch can be reduced to minimize the interruption to running workloads during the period, at the cost of extending the period. To control the proportion or set the exact number of workers to restart, set the following in alluxio-cluster.yaml
:
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
worker:
rollingUpdate:
maxUnavailable: 1 # by default this is 10%, but in addition to setting a percentage value, it can also be set to an exact number
Dynamically Updating Configuration
You can change Alluxio properties in a running cluster by editing its ConfigMap.
Find the ConfigMap for your cluster.
$ kubectl -n alx-ns get configmap NAME DATA AGE alluxio-cluster-alluxio-conf 4 7m48s ...
Edit the ConfigMap to modify
alluxio-site.properties
,alluxio-env.sh
, etc.$ kubectl -n alx-ns edit configmap alluxio-cluster-alluxio-conf
Restart components to apply the changes.
Coordinator:
kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator
Workers:
kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker
DaemonSet FUSE:
kubectl -n alx-ns rollout restart daemonset alluxio-fuse
CSI FUSE: These pods must be restarted by exiting the application pod or by manually deleting the FUSE pod (
kubectl -n alx-ns delete pod <fuse-pod-name>
).
2. Hash Ring Management
Alluxio uses a consistent hash ring to map data to workers in a decentralized manner. You can fine-tune its behavior to optimize for different cluster environments and workloads.
Configuring the Hash Ring Mode
The consistent hash ring can operate in two modes: dynamic (default) or static.
In dynamic mode, the hash ring includes only online workers. When a worker goes offline, it is removed from the ring, and its virtual nodes are redistributed. This is ideal for most use cases, providing high adaptability.
In static mode, the hash ring includes all registered workers, regardless of their online status. This is useful in scenarios where you need a consistent view of all workers to minimize data re-fetching from UFS, even if some workers are temporarily unavailable.
To configure the mode, set the alluxio.user.dynamic.consistent.hash.ring.enabled
property. Set it to true
for dynamic mode (the default) or false
for static mode.
Adjusting Virtual Nodes for Load Balancing
To ensure an even distribution of data and I/O requests, Alluxio uses virtual nodes. Each worker is mapped to multiple virtual nodes on the hash ring, which helps to balance the load more effectively across the cluster.
You can adjust the number of virtual nodes per worker by configuring the alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker
property (default: 2000
). Adjusting this value can help fine-tune load distribution, especially in clusters with diverse workloads or a small number of workers.
Optimizing for Heterogeneous Workers
By default, the consistent hashing algorithm assumes that all workers have equal capacity. In clusters with heterogeneous workers (e.g., different storage capacities or network speeds), you can enable capacity-based allocation for more balanced resource utilization. This ensures that workers with more storage handle a proportionally larger share of data.
To enable this, set the alluxio.user.worker.selection.policy.consistent.hash.provider.impl
property to CAPACITY
. The default value is DEFAULT
, which allocates an equal number of virtual nodes to each worker.
3. Worker Management
Alluxio's decentralized architecture relies on workers that are managed via a consistent hash ring.
Checking Worker Status
To see a list of all registered workers and their current status (online or offline):
bin/alluxio info nodes
Adding a New Worker
To add a new worker to the cluster:
Install the Alluxio software on the new node.
Ensure the
alluxio-site.properties
file is configured to point to your etcd cluster.Start the worker process. It will automatically register itself in etcd and join the consistent hashing ring.
Removing a Worker Permanently
If you need to decommission a worker permanently:
Shut down the worker process on the target node.
Get the Worker ID by running
bin/alluxio info nodes
.Remove the worker using its ID.
bin/alluxio process remove-worker -n <worker_id>
Verify removal by running
bin/alluxio info nodes
again.
Important: Removing a worker is a permanent action that will cause its portion of the hash ring to be redistributed, potentially causing a temporary increase in cache misses.
Restarting a Worker
If you restart a worker for maintenance, it will be temporarily marked as offline. As long as its identity is preserved (via alluxio.worker.identity.uuid.file.path
), it will rejoin the cluster with its cached data intact and available.
4. UFS Mount Management
Alluxio's unified namespace allows you to mount multiple under storage systems (UFS) into a single logical view. The mount table that manages these connections is stored in etcd for high availability. Alluxio components periodically poll etcd for the latest mount table, so any changes are automatically propagated throughout the cluster.
Managing Mount Points
Use UnderFileSystem
configuration to manage UFS mounts. To add a new mount, see Connect to Storage guide.
List Mounts:
# You can also list all submitted configurations with Kubernetes CLI. $ kubectl -n alx-ns get ufs NAME PHASE AGE alluxio-s3 Ready 13d
Remove a Mount:
$ kubectl -n alx-ns delete ufs alluxio-s3 underfilesystem.k8s-operator.alluxio.com "alluxio-s3" deleted
Configuring UFS Credentials
You can provide credentials for a UFS globally or on a per-mount basis.
Global Configuration: Set properties for all mounts of a certain type (e.g., all S3 mounts) in
alluxio-site.properties
.# alluxio-site.properties s3a.accessKeyId=<S3_ACCESS_KEY> s3a.secretKey=<S3_SECRET_KEY>
Per-Mount Configuration (Recommended): Provide credentials as options during the mount command. This is the most flexible and secure method, ideal for connecting to multiple systems with different credentials. Per-mount options override global settings.
apiVersion: k8s-operator.alluxio.com/v1 kind: UnderFileSystem metadata: name: alluxio-s3 namespace: alx-ns spec: alluxioCluster: alluxio-cluster path: s3://bucket-a/images mountPath: /s3-images mountOptions: s3a.accessKeyId: <S3_ACCESS_KEY_ID> s3a.secretKey: <S3_SECRET_KEY>
Rules for Mounting
When defining your namespace, you must follow two important rules to ensure a valid and unambiguous mount table.
Rule 1: Mounts Must Be Direct Children of the Root (/
)
/
)You can only create mount points at the top level of the Alluxio namespace. You cannot mount to the root path (/
) itself, nor can you create a mount point inside a non-existent directory.
Examples:
Mount a bucket
/s3-data
s3://my-bucket/
✔️ Yes
Mount point is a direct child of root.
Mount to root
/
s3://my-bucket/
❌ No
The root path cannot be a mount point.
Mount to a sub-path
/data/images
s3://my-bucket/images/
❌ No
Mount points cannot be created in subdirectories.
Rule 2: Mounts Cannot Be Nested
One mount point cannot be created inside another, either in the Alluxio namespace or in the UFS namespace. For example, if /data
is mounted to s3://my-bucket/data
, you cannot create a new mount at /data/tables
(nested Alluxio path) or mount another UFS to s3://my-bucket/data/tables
(nested UFS path).
Example Scenario:
Suppose you have an existing mount point:
Alluxio Path:
/data
UFS Path:
s3://my-bucket/data
The following new mounts would be invalid:
/data/tables
hdfs://namenode/tables
❌ No
The Alluxio path /data/tables
is nested inside the existing /data
mount.
/tables
s3://my-bucket/data/tables
❌ No
The UFS path s3://.../data/tables
is nested inside the existing s3://.../data
mount.
5. Multi-Tenancy and Federation
For large-scale enterprise deployments, Alluxio provides advanced features for multi-tenancy and cluster federation. This allows multiple teams and business units to share data infrastructure securely and efficiently while simplifying administrative overhead.
The reference architecture below features an API Gateway that centrally handles authentication and authorization across multiple Alluxio clusters.
Core Concepts
Authentication
Alluxio integrates with external enterprise identity providers like Okta. When a user logs in, the provider authenticates them and generates a JSON Web Token (JWT). This JWT is then sent with every subsequent request to the Alluxio API Gateway to verify the user's identity.
Authorization
Once a user is authenticated, Alluxio uses an external policy engine, Open Policy Agent (OPA), to determine what actions the user is authorized to perform. Administrators can write fine-grained access control policies in OPA's declarative language, Rego, to control which users can access which resources. The API Gateway queries OPA for every request to ensure it is authorized.
Multi-Tenancy and Isolation
Alluxio enforces isolation between tenants to ensure security and prevent interference. This is achieved through:
User Roles: Defining different roles with specific access levels and permissions.
Cache Isolation: Assigning tenant-specific cache configurations, including quotas, TTLs, and eviction policies, ensuring one tenant's workload does not negatively impact another's.
Cluster Federation
For organizations with multiple Alluxio clusters (e.g., across different regions or for different business units), federation simplifies management. A central Management Console provides a single pane of glass for:
Cross-cluster monitoring and metrics.
Executing operations across multiple clusters simultaneously.
Centralized license management for all clusters.
Example Workflow: Updating a Cache Policy
This workflow demonstrates how the components work together:
Authentication: A user logs into the Management Console, which redirects them to Okta for authentication. Upon success, Okta issues a JWT.
Request Submission: The user uses the console to submit a request to change a cache TTL. The request, containing the JWT, is sent to the API Gateway.
Authorization: The API Gateway validates the JWT and queries the OPA Policy Engine to check if the user has permission to modify cache settings for the target tenant.
Execution: If the request is authorized, the API Gateway forwards the command to the coordinator of the relevant Alluxio cluster, which then applies the new TTL policy.
Last updated