# Cluster Management

This guide covers cluster-wide administration: hardening a basic install for production, ongoing lifecycle operations (scaling, upgrades, dynamic config), and multi-tenancy. For job submission, scheduling, and Coordinator HA, see [Job Service](/ee-ai-en/administration/managing-job-service.md). For hash-ring-related operations (worker lifecycle, identity persistence, ring bloat), see [Hash Ring and Worker Lifecycle](/ee-ai-en/administration/managing-ring.md). For per-worker configuration (storage, resources, JVM, network), see [Worker Configuration](/ee-ai-en/administration/managing-worker.md).

## 1. Production Setup

The basic configuration shown in the [Kubernetes Installation](/ee-ai-en/start/installing-on-kubernetes.md) guide is suitable for evaluation. For production deployments, apply the additional settings below for HA, resource tuning, persistent metadata, and worker identity.

### Label Nodes

A common practice is to assign dedicated nodes to each Alluxio component. This prevents resource contention between components (for example, etcd I/O interfering with worker cache I/O) and gives you predictable placement for capacity planning.

```shell
kubectl label nodes <coordinator-node> alluxio-role=coordinator
kubectl label nodes <worker-node-1> alluxio-role=worker
kubectl label nodes <worker-node-2> alluxio-role=worker
kubectl label nodes <worker-node-3> alluxio-role=worker
kubectl label nodes <etcd-node-1> alluxio-role=etcd
kubectl label nodes <etcd-node-2> alluxio-role=etcd
kubectl label nodes <etcd-node-3> alluxio-role=etcd
```

Worker pods have an anti-affinity rule by default — multiple worker pods will not be scheduled on the same node.

### Production `alluxio-cluster.yaml`

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
metadata:
  name: alluxio-cluster
  namespace: alx-ns
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.8-15.1.0
  properties:
    alluxio.license: <YOUR_CLUSTER_LICENSE>
  coordinator:
    nodeSelector:
      alluxio-role: coordinator
    metastore:
      type: persistentVolumeClaim
      storageClass: "gp2"
      size: 4Gi
    resources:
      # Set requests equal to limits for Guaranteed QoS, same as the worker.
      limits:
        cpu: "8"
        memory: "16Gi"
      requests:
        cpu: "8"
        memory: "16Gi"
    jvmOptions:
      - "-Xmx8g"
      - "-Xms8g"
  worker:
    nodeSelector:
      alluxio-role: worker
    count: 3
    pagestore:
      size: 1000Gi
      reservedSize: 100Gi
    resources:
      # Set requests equal to limits so the worker runs in the Guaranteed
      # QoS class and is the last to be evicted under node pressure.
      limits:
        cpu: "8"
        memory: "24Gi"
      requests:
        cpu: "8"
        memory: "24Gi"
    jvmOptions:
      - "-Xmx12g"
      - "-Xms12g"
      - "-XX:MaxDirectMemorySize=12g"
  etcd:
    replicaCount: 3
    nodeSelector:
      alluxio-role: etcd
```

Key differences from the basic configuration:

* **Node selectors**: Pin each component to dedicated nodes to prevent resource contention and ensure predictable placement. See the label commands above.
* **Worker count**: number of workers depending on the target cache volume and target throughput. For post-deployment scaling, see [Scaling the Cluster](#scaling-the-cluster).
* **ETCD replicas**: 3 for quorum-based HA. Deploy on dedicated, stable nodes.
* **Resource limits and JVM options**: Explicitly set to prevent OOM. The container memory limit must exceed the sum of `-Xmx` and `-XX:MaxDirectMemorySize`. For both workers and the coordinator, set `requests` equal to `limits` — this places the pods in the [Guaranteed QoS class](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/), so they are the last to be evicted when the node is under memory or CPU pressure. A Burstable pod (requests < limits) can be evicted while still inside its limit if another pod exceeds its request, causing abrupt cache loss (worker) or scheduler disruption (coordinator).
* **Persistent metastore**: Coordinator metadata survives pod restarts.

Other important settings for production deployment:

* **License Management**: A cluster license is the simplest way to get started. For production environments, a **deployment license** is recommended. See [Appendix C: License Management](https://documentation.alluxio.io/ee-ai-en/administration/pages/P92iY7CDvLx99wPxm2On#c.-license-management) for details on both options.
* **Hash Ring Configuration**: It is critical to configure the hash ring **before** deployment, as changes can be destructive. For detailed guidance, see [Hash Ring Pre-Deployment Configuration](https://documentation.alluxio.io/ee-ai-en/administration/pages/paqXm2SRU7Uo5W1x23MT#id-1.-pre-deployment-configuration).
* **Worker Identity Persistence**: Configure `worker.systemInfo.hostPath` so workers rejoin with the same UUID after restarts. Without this, each restart adds stale `OFFLINE` entries to the hash ring, degrading cache hit rates. See [Restarting a Worker](/ee-ai-en/administration/managing-ring.md#restarting-a-worker).
* **Heterogeneous Clusters**: If your cluster includes workers with different capacities, you must define a specific data distribution strategy. See [Heterogeneous Workers](/ee-ai-en/administration/managing-worker.md#heterogeneous-workers) for configuration steps.
* **Worker Page Store**: The [page store](https://documentation.alluxio.io/ee-ai-en/administration/pages/iRTxT4smG58AwFmCihOx#id-5.-worker-storage-the-page-store) is where each worker caches data. Key defaults and options:
  * **Default:** `type: hostPath`, `hostPath: /mnt/alluxio/pagestore`. The worker writes cache to the node's filesystem at that path. On multi-disk nodes, verify this lands on a data disk, not the system disk.
  * **Multi-disk nodes:** Set `pagestore.hostPath` explicitly to a data disk (e.g. `/mnt/data1/alluxio/pagestore`). See [Multi-Disk Configuration](/ee-ai-en/administration/managing-worker.md#multi-disk-configuration).
  * **Persistent cache across pod restarts:** Use a PVC instead of hostPath. See [Configuring Page Store Location](/ee-ai-en/administration/managing-worker.md#configuring-page-store-location).
  * **Sizing:** The `size` parameter sets the cache capacity; `reservedSize` allocates space for internal operations (temporary page writes, file metadata caching). Set `reservedSize` to \~10% of `size` (10–100 GiB) and ensure the total (size + reservedSize) fits within the worker's storage. For the cache size itself, target **at least 110–120% of your working set** (the data you expect to keep hot). Alluxio starts evicting at the high-watermark threshold (default 90% of `size`), so a 10–20% headroom above the working set avoids evicting in-use data during peak load jobs.
* **Advanced Configuration**: For resource and JVM tuning, see [Worker Configuration — Resource and JVM Tuning](https://documentation.alluxio.io/ee-ai-en/administration/pages/MF8PfNU0P7q8PD9eh6xW#id-2.-resource-and-jvm-tuning). For other settings like external etcd, refer to [Appendix B: Advanced Configuration](https://documentation.alluxio.io/ee-ai-en/administration/pages/P92iY7CDvLx99wPxm2On#b.-advanced-configuration).

### Running Multiple Clusters on Shared Nodes

If multiple Alluxio clusters are deployed across different namespaces on the same Kubernetes cluster, services from different clusters may be scheduled onto the same node, causing deployment failures. Label nodes to indicate which cluster they belong to:

```shell
kubectl label nodes <node-name> cluster=alluxio-a
kubectl label nodes <node-name> cluster=alluxio-b
```

Then specify the `nodeSelector` at the cluster level in each `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  image: <PRIVATE_REGISTRY>/alluxio-enterprise
  imageTag: AI-3.8-15.1.0
  nodeSelector:
    cluster: alluxio-a
```

## 2. Cluster Lifecycle and Configuration

This section covers fundamental operations related to the cluster's lifecycle, such as scaling, upgrades, and dynamic configuration updates.

### Scaling the Cluster

You can dynamically scale the number of Alluxio workers up or down. Because Alluxio uses a consistent hash ring to map files to workers, adding or removing workers reshards the ring — some files that were mapped to existing workers will now map to the new worker (or to different workers after scale-down). Those slots start empty, so **scale operations must be coordinated with running load jobs** to avoid coverage gaps.

{% hint style="warning" %}
Always stop running load jobs before scaling. A load job running across a ring topology change may produce incomplete cache coverage — files assigned to the new worker after resharding will not have been loaded.
{% endhint %}

#### Scale Up

**Step 1: Stop any running load jobs.**

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# List running jobs
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job list --job-type LOAD --job-state RUNNING
# Expected: empty output if no load jobs are running

# Stop each running job by path
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <path> --stop
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job list --job-type LOAD --job-state RUNNING
# Expected: empty output if no load jobs are running

bin/alluxio job load --path <path> --stop
```

{% endtab %}
{% endtabs %}

**Step 2: Increase the worker count and apply.**

Modify `alluxio-cluster.yaml` and increase `worker.count` (example: 2 → 3):

```shell
kubectl apply -f alluxio-cluster.yaml
```

```console
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured
```

**Step 3: Wait for all workers to become ready.**

```shell
kubectl -n alx-ns get pod -l app.kubernetes.io/component=worker
```

```console
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-cluster-worker-58999f8ddd-cd6r2       1/1     Running   0          5m21s
alluxio-cluster-worker-58999f8ddd-rtftk       1/1     Running   0          4m21s
alluxio-cluster-worker-58999f8ddd-p6n59       1/1     Running   0          34s
```

Alternatively, wait with a timeout:

```shell
kubectl -n alx-ns wait pod -l app.kubernetes.io/component=worker \
  --for=condition=Ready --timeout=300s
```

**Step 4: Re-trigger data loading.**

After resharding, the new worker holds no cached data — client requests for files that now map to it are served from UFS until those files are loaded onto it. Data already cached on existing workers is **not deleted or migrated automatically**; those files become orphaned replicas that are no longer routed to and will evict naturally via LRU.

Choose the strategy based on your capacity situation:

| Situation                                                          | Command                     | Effect                                                                                        |
| ------------------------------------------------------------------ | --------------------------- | --------------------------------------------------------------------------------------------- |
| Cache has spare space; OK to let orphaned replicas evict naturally | `job load --skip-if-exists` | Loads missing files onto the new worker from UFS; does not touch old workers                  |
| Cache is tight, or you want to free space on existing workers now  | `job rebalance`             | Copies files to the new worker from existing workers (not UFS) and prunes the orphaned copies |

**Option A — `job load --skip-if-exists`**

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists

# Monitor until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists

# Monitor until SUCCEEDED
bin/alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}
{% endtabs %}

**Option B — `job rebalance` (active redistribution)**

`job rebalance` runs a **load phase** (copy data to the new worker from existing workers) and a **prune phase** (delete orphaned copies from old workers). Use `--load-bandwidth` / `--prune-bandwidth` to rate-limit I/O if training workloads are running concurrently. Use `--skip-prune` to skip deletion and let LRU handle cleanup.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job rebalance --submit --target ALL

# Monitor until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job rebalance --progress --target ALL
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job rebalance --submit --target ALL

# Monitor until SUCCEEDED
bin/alluxio job rebalance --progress --target ALL
```

{% endtab %}
{% endtabs %}

#### Scale Down

**Step 1: Stop any running load jobs** (same as scale-up Step 1 above).

**Step 2: Decrease the worker count and apply.**

```shell
kubectl apply -f alluxio-cluster.yaml
```

Wait for the removed worker pods to terminate:

```shell
kubectl -n alx-ns wait pod -l app.kubernetes.io/component=worker \
  --for=delete --timeout=120s
```

Scaling down removes workers from the hash ring. Their cached data becomes unreachable and will eventually be evicted. Files that were cached only on the removed workers will be served from UFS until re-loaded by passive caching or a new `job load`.

**Step 3: Optionally deregister stale worker entries.**

In dynamic mode (default), `OFFLINE` entries are automatically cleaned up after `alluxio.worker.failure.detection.timeout`. In static mode, run `alluxio process remove-worker` for each decommissioned worker — see [Removing a Worker Permanently](/ee-ai-en/administration/managing-ring.md#removing-a-worker-permanently).

### Upgrading Alluxio

The upgrade process involves two main steps: upgrading the Alluxio Operator and then upgrading the Alluxio cluster itself.

#### Step 1: Upgrade the Operator

The operator is stateless and can be safely re-installed without affecting the running Alluxio cluster.

1. Obtain the new Docker images for the operator and the new Helm chart.
2. Uninstall the old operator and install the new one.

Uninstall the current operator:

```shell
helm -n alluxio-operator uninstall operator --wait
```

```console
release "operator" uninstalled
```

Replace the CRDs from the new Helm chart directory and create the new ones:

```shell
kubectl create -f alluxio-operator/crds
kubectl replace -f alluxio-operator/crds
```

```console
customresourcedefinition.apiextensions.k8s.io/alluxioclusters.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/clustergroups.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/collectinfoes.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/licenses.k8s-operator.alluxio.com replaced
customresourcedefinition.apiextensions.k8s.io/underfilesystems.k8s-operator.alluxio.com replaced
```

Install the new operator using your configuration file (update the image tag):

```shell
helm -n alluxio-operator install operator -f alluxio-operator.yaml --create-namespace .
```

#### Step 2: Upgrade the Alluxio Cluster

The operator will perform a rolling upgrade of the Alluxio components.

1. Upload the new Alluxio Docker images to your registry.
2. Update the `imageTag` in your `alluxio-cluster.yaml` to the new version.
3. Apply the configuration change.

Apply the updated cluster definition:

```shell
kubectl apply -f alluxio-cluster.yaml
```

```console
alluxiocluster.k8s-operator.alluxio.com/alluxio-cluster configured
```

Monitor the rolling upgrade process:

```shell
kubectl -n alx-ns get pod
```

```console
NAME                                          READY   STATUS     RESTARTS   AGE
alluxio-cluster-coordinator-0                 0/1     Init:0/2   0          7s
...
alluxio-cluster-worker-58999f8ddd-cd6r2       0/1     Init:0/2   0          7s
alluxio-cluster-worker-5d6786f5bf-cxv5j       1/1     Running    0          10m
```

Check the cluster status until it returns to 'Ready':

```shell
kubectl -n alx-ns get alluxiocluster
```

```console
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Updating       10m
...
NAME              CLUSTERPHASE   AGE
alluxio-cluster   Ready          12m
```

Verify the new version is running:

```shell
kubectl -n alx-ns exec -it alluxio-cluster-coordinator-0 -- alluxio info version 2>/dev/null
```

```console
AI-3.8-15.1.0
```

During the rolling upgrade, workers are restarted in batches, such that the workers in the current batch must be fully ready before the next batch starts. The default batch size is 10% of the workers.

The number of workers in a batch can be reduced to minimize the interruption to running workloads during the period, at the cost of extending the period. To control the proportion or set the exact number of workers to restart, set the following in `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    rollingUpdate:
      maxUnavailable: 1  # by default this is 10%, but in addition to setting a percentage value, it can also be set to an exact number
```

### Dynamically Updating Configuration

You can change Alluxio properties in a running cluster by editing its ConfigMap.

1. **Find the ConfigMap** for your cluster.

   ```shell
   kubectl -n alx-ns get configmap
   ```

   ```console
   NAME                              DATA   AGE
   alluxio-cluster-alluxio-conf      4      7m48s
   ...
   ```
2. **Edit the ConfigMap** to modify `alluxio-site.properties`, `alluxio-env.sh`, etc.

   ```shell
   kubectl -n alx-ns edit configmap alluxio-cluster-alluxio-conf
   ```
3. **Restart components** to apply the changes.
   * **Coordinator:** `kubectl -n alx-ns rollout restart statefulset alluxio-cluster-coordinator`
   * **Workers:** `kubectl -n alx-ns rollout restart deployment alluxio-cluster-worker`
   * **DaemonSet FUSE:** `kubectl -n alx-ns rollout restart daemonset alluxio-fuse`
   * **CSI FUSE:** These pods must be restarted by exiting the application pod or by manually deleting the FUSE pod (`kubectl -n alx-ns delete pod <fuse-pod-name>`).

## 3. Multi-Tenancy and Federation

For large-scale enterprise deployments, Alluxio provides advanced features for multi-tenancy and cluster federation. This allows multiple teams and business units to share data infrastructure securely and efficiently while simplifying administrative overhead.

The reference architecture below features an API Gateway that centrally handles authentication and authorization across multiple Alluxio clusters.

### Core Concepts

#### Authentication

Alluxio integrates with external enterprise identity providers like **Okta**. When a user logs in, the provider authenticates them and generates a [**JSON Web Token (JWT)**](https://auth0.com/docs/secure/tokens/access-tokens/access-token-profiles). This JWT is then sent with every subsequent request to the Alluxio API Gateway to verify the user's identity.

#### Authorization

Once a user is authenticated, Alluxio uses an external policy engine, [**Open Policy Agent (OPA)**](https://www.openpolicyagent.org/docs/latest/#running-opa), to determine what actions the user is authorized to perform. Administrators can write fine-grained access control policies in OPA's declarative language, **Rego**, to control which users can access which resources. The API Gateway queries OPA for every request to ensure it is authorized.

#### Multi-Tenancy and Isolation

Alluxio enforces isolation between tenants to ensure security and prevent interference. This is achieved through:

* **User Roles:** Defining different roles with specific access levels and permissions.
* **Cache Isolation:** Assigning tenant-specific cache configurations, including quotas, TTLs, and eviction policies, ensuring one tenant's workload does not negatively impact another's.

### Cluster Federation

For organizations with multiple Alluxio clusters (e.g., across different regions or for different business units), federation simplifies management. A central **Management Console** provides a single pane of glass for:

* Cross-cluster monitoring and metrics.
* Executing operations across multiple clusters simultaneously.
* Centralized license management for all clusters.

### Example Workflow: Updating a Cache Policy

This workflow demonstrates how the components work together:

1. **Authentication:** A user logs into the **Management Console**, which redirects them to **Okta** for authentication. Upon success, Okta issues a JWT.
2. **Request Submission:** The user uses the console to submit a request to change a cache TTL. The request, containing the JWT, is sent to the **API Gateway**.
3. **Authorization:** The API Gateway validates the JWT and queries the **OPA Policy Engine** to check if the user has permission to modify cache settings for the target tenant.
4. **Execution:** If the request is authorized, the API Gateway forwards the command to the coordinator of the relevant Alluxio cluster, which then applies the new TTL policy.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/administration/managing-alluxio.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
