Cluster Management

This guide covers cluster-wide administration: hardening a basic install for production, ongoing lifecycle operations (scaling, upgrades, dynamic config), and multi-tenancy. For job submission, scheduling, and Coordinator HA, see Job Service. For hash-ring-related operations (worker lifecycle, identity persistence, ring bloat), see Hash Ring and Worker Lifecycle. For per-worker configuration (storage, resources, JVM, network), see Worker Configuration.

1. Production Setup

The basic configuration shown in the Kubernetes Installation guide is suitable for evaluation. For production deployments, apply the additional settings below for HA, resource tuning, persistent metadata, and worker identity.

Label Nodes

A common practice is to assign dedicated nodes to each Alluxio component. This prevents resource contention between components (for example, etcd I/O interfering with worker cache I/O) and gives you predictable placement for capacity planning.

kubectl label nodes <coordinator-node> alluxio-role=coordinator
kubectl label nodes <worker-node-1> alluxio-role=worker
kubectl label nodes <worker-node-2> alluxio-role=worker
kubectl label nodes <worker-node-3> alluxio-role=worker
kubectl label nodes <etcd-node-1> alluxio-role=etcd
kubectl label nodes <etcd-node-2> alluxio-role=etcd
kubectl label nodes <etcd-node-3> alluxio-role=etcd

Worker pods have an anti-affinity rule by default — multiple worker pods will not be scheduled on the same node.

Production alluxio-cluster.yaml

apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.8-15.1.0
  properties:
    alluxio.license: <YOUR_CLUSTER_LICENSE>
  coordinator:
    nodeSelector:
      alluxio-role: coordinator
    metastore:
      type: persistentVolumeClaim
      storageClass: "gp2"
      size: 4Gi
    resources:
      # Set requests equal to limits for Guaranteed QoS, same as the worker.
      limits:
        cpu: "8"
        memory: "16Gi"
      requests:
        cpu: "8"
        memory: "16Gi"
    jvmOptions:
      - "-Xmx8g"
      - "-Xms8g"
  worker:
    nodeSelector:
      alluxio-role: worker
    count: 3
    pagestore:
      size: 1000Gi
      reservedSize: 100Gi
    resources:
      # Set requests equal to limits so the worker runs in the Guaranteed
      # QoS class and is the last to be evicted under node pressure.
      limits:
        cpu: "8"
        memory: "24Gi"
      requests:
        cpu: "8"
        memory: "24Gi"
    jvmOptions:
      - "-Xmx12g"
      - "-Xms12g"
      - "-XX:MaxDirectMemorySize=12g"
  etcd:
    replicaCount: 3
    nodeSelector:
      alluxio-role: etcd

Key differences from the basic configuration:

  • Node selectors: Pin each component to dedicated nodes to prevent resource contention and ensure predictable placement. See the label commands above.

  • Worker count: number of workers depending on the target cache volume and target throughput. For post-deployment scaling, see Scaling the Cluster.

  • ETCD replicas: 3 for quorum-based HA. Deploy on dedicated, stable nodes.

  • Resource limits and JVM options: Explicitly set to prevent OOM. The container memory limit must exceed the sum of -Xmx and -XX:MaxDirectMemorySize. For both workers and the coordinator, set requests equal to limits — this places the pods in the Guaranteed QoS class, so they are the last to be evicted when the node is under memory or CPU pressure. A Burstable pod (requests < limits) can be evicted while still inside its limit if another pod exceeds its request, causing abrupt cache loss (worker) or scheduler disruption (coordinator).

  • Persistent metastore: Coordinator metadata survives pod restarts.

Other important settings for production deployment:

  • License Management: A cluster license is the simplest way to get started. For production environments, a deployment license is recommended. See Appendix C: License Management for details on both options.

  • Hash Ring Configuration: It is critical to configure the hash ring before deployment, as changes can be destructive. For detailed guidance, see Hash Ring Pre-Deployment Configuration.

  • Worker Identity Persistence: Configure worker.systemInfo.hostPath so workers rejoin with the same UUID after restarts. Without this, each restart adds stale OFFLINE entries to the hash ring, degrading cache hit rates. See Restarting a Worker.

  • Heterogeneous Clusters: If your cluster includes workers with different capacities, you must define a specific data distribution strategy. See Heterogeneous Workers for configuration steps.

  • Worker Page Store: The page store is where each worker caches data. Key defaults and options:

    • Default: type: hostPath, hostPath: /mnt/alluxio/pagestore. The worker writes cache to the node's filesystem at that path. On multi-disk nodes, verify this lands on a data disk, not the system disk.

    • Multi-disk nodes: Set pagestore.hostPath explicitly to a data disk (e.g. /mnt/data1/alluxio/pagestore). See Multi-Disk Configuration.

    • Persistent cache across pod restarts: Use a PVC instead of hostPath. See Configuring Page Store Location.

    • Sizing: The size parameter sets the cache capacity; reservedSize allocates space for internal operations (temporary page writes, file metadata caching). Set reservedSize to ~10% of size (10–100 GiB) and ensure the total (size + reservedSize) fits within the worker's storage. For the cache size itself, target at least 110–120% of your working set (the data you expect to keep hot). Alluxio starts evicting at the high-watermark threshold (default 90% of size), so a 10–20% headroom above the working set avoids evicting in-use data during peak load jobs.

  • Advanced Configuration: For resource and JVM tuning, see Worker Configuration — Resource and JVM Tuning. For other settings like external etcd, refer to Appendix B: Advanced Configuration.

Running Multiple Clusters on Shared Nodes

If multiple Alluxio clusters are deployed across different namespaces on the same Kubernetes cluster, services from different clusters may be scheduled onto the same node, causing deployment failures. Label nodes to indicate which cluster they belong to:

Then specify the nodeSelector at the cluster level in each alluxio-cluster.yaml:

2. Cluster Lifecycle and Configuration

This section covers fundamental operations related to the cluster's lifecycle, such as scaling, upgrades, and dynamic configuration updates.

Scaling the Cluster

You can dynamically scale the number of Alluxio workers up or down. Because Alluxio uses a consistent hash ring to map files to workers, adding or removing workers reshards the ring — some files that were mapped to existing workers will now map to the new worker (or to different workers after scale-down). Those slots start empty, so scale operations must be coordinated with running load jobs to avoid coverage gaps.

Scale Up

Step 1: Stop any running load jobs.

Step 2: Increase the worker count and apply.

Modify alluxio-cluster.yaml and increase worker.count (example: 2 → 3):

Step 3: Wait for all workers to become ready.

Alternatively, wait with a timeout:

Step 4: Re-trigger data loading.

After resharding, the new worker holds no cached data — client requests for files that now map to it are served from UFS until those files are loaded onto it. Data already cached on existing workers is not deleted or migrated automatically; those files become orphaned replicas that are no longer routed to and will evict naturally via LRU.

Choose the strategy based on your capacity situation:

Situation
Command
Effect

Cache has spare space; OK to let orphaned replicas evict naturally

job load --skip-if-exists

Loads missing files onto the new worker from UFS; does not touch old workers

Cache is tight, or you want to free space on existing workers now

job rebalance

Copies files to the new worker from existing workers (not UFS) and prunes the orphaned copies

Option A — job load --skip-if-exists

Option B — job rebalance (active redistribution)

job rebalance runs a load phase (copy data to the new worker from existing workers) and a prune phase (delete orphaned copies from old workers). Use --load-bandwidth / --prune-bandwidth to rate-limit I/O if training workloads are running concurrently. Use --skip-prune to skip deletion and let LRU handle cleanup.

Scale Down

Step 1: Stop any running load jobs (same as scale-up Step 1 above).

Step 2: Decrease the worker count and apply.

Wait for the removed worker pods to terminate:

Scaling down removes workers from the hash ring. Their cached data becomes unreachable and will eventually be evicted. Files that were cached only on the removed workers will be served from UFS until re-loaded by passive caching or a new job load.

Step 3: Optionally deregister stale worker entries.

In dynamic mode (default), OFFLINE entries are automatically cleaned up after alluxio.worker.failure.detection.timeout. In static mode, run alluxio process remove-worker for each decommissioned worker — see Removing a Worker Permanently.

Upgrading Alluxio

The upgrade process involves two main steps: upgrading the Alluxio Operator and then upgrading the Alluxio cluster itself.

Step 1: Upgrade the Operator

The operator is stateless and can be safely re-installed without affecting the running Alluxio cluster.

  1. Obtain the new Docker images for the operator and the new Helm chart.

  2. Uninstall the old operator and install the new one.

Uninstall the current operator:

Replace the CRDs from the new Helm chart directory and create the new ones:

Install the new operator using your configuration file (update the image tag):

Step 2: Upgrade the Alluxio Cluster

The operator will perform a rolling upgrade of the Alluxio components.

  1. Upload the new Alluxio Docker images to your registry.

  2. Update the imageTag in your alluxio-cluster.yaml to the new version.

  3. Apply the configuration change.

Apply the updated cluster definition:

Monitor the rolling upgrade process:

Check the cluster status until it returns to 'Ready':

Verify the new version is running:

During the rolling upgrade, workers are restarted in batches, such that the workers in the current batch must be fully ready before the next batch starts. The default batch size is 10% of the workers.

The number of workers in a batch can be reduced to minimize the interruption to running workloads during the period, at the cost of extending the period. To control the proportion or set the exact number of workers to restart, set the following in alluxio-cluster.yaml:

Dynamically Updating Configuration

You can change Alluxio properties in a running cluster by editing its ConfigMap.

  1. Find the ConfigMap for your cluster.

  2. Edit the ConfigMap to modify alluxio-site.properties, alluxio-env.sh, etc.

  3. Restart components to apply the changes.

    • Coordinator: kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator

    • Workers: kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker

    • DaemonSet FUSE: kubectl -n alx-ns rollout restart daemonset alluxio-fuse

    • CSI FUSE: These pods must be restarted by exiting the application pod or by manually deleting the FUSE pod (kubectl -n alx-ns delete pod <fuse-pod-name>).

3. Multi-Tenancy and Federation

For large-scale enterprise deployments, Alluxio provides advanced features for multi-tenancy and cluster federation. This allows multiple teams and business units to share data infrastructure securely and efficiently while simplifying administrative overhead.

The reference architecture below features an API Gateway that centrally handles authentication and authorization across multiple Alluxio clusters.

Core Concepts

Authentication

Alluxio integrates with external enterprise identity providers like Okta. When a user logs in, the provider authenticates them and generates a JSON Web Token (JWT). This JWT is then sent with every subsequent request to the Alluxio API Gateway to verify the user's identity.

Authorization

Once a user is authenticated, Alluxio uses an external policy engine, Open Policy Agent (OPA), to determine what actions the user is authorized to perform. Administrators can write fine-grained access control policies in OPA's declarative language, Rego, to control which users can access which resources. The API Gateway queries OPA for every request to ensure it is authorized.

Multi-Tenancy and Isolation

Alluxio enforces isolation between tenants to ensure security and prevent interference. This is achieved through:

  • User Roles: Defining different roles with specific access levels and permissions.

  • Cache Isolation: Assigning tenant-specific cache configurations, including quotas, TTLs, and eviction policies, ensuring one tenant's workload does not negatively impact another's.

Cluster Federation

For organizations with multiple Alluxio clusters (e.g., across different regions or for different business units), federation simplifies management. A central Management Console provides a single pane of glass for:

  • Cross-cluster monitoring and metrics.

  • Executing operations across multiple clusters simultaneously.

  • Centralized license management for all clusters.

Example Workflow: Updating a Cache Policy

This workflow demonstrates how the components work together:

  1. Authentication: A user logs into the Management Console, which redirects them to Okta for authentication. Upon success, Okta issues a JWT.

  2. Request Submission: The user uses the console to submit a request to change a cache TTL. The request, containing the JWT, is sent to the API Gateway.

  3. Authorization: The API Gateway validates the JWT and queries the OPA Policy Engine to check if the user has permission to modify cache settings for the target tenant.

  4. Execution: If the request is authorized, the API Gateway forwards the command to the coordinator of the relevant Alluxio cluster, which then applies the new TTL policy.

Last updated