> For the complete documentation index, see [llms.txt](https://documentation.alluxio.io/ee-ai-en/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://documentation.alluxio.io/ee-ai-en/administration/managing-worker.md).

# Worker Configuration

This guide covers configuration for individual Alluxio workers: storage backends, capacity sizing, resource limits, JVM tuning, and network binding. For hash-ring-related worker operations (adding, removing, restarting, identity persistence), see [Hash Ring and Worker Lifecycle](/ee-ai-en/administration/managing-ring.md). For cluster-wide operations (scaling, upgrades, coordinator, UFS, multi-tenancy), see [Cluster Management](/ee-ai-en/administration/managing-alluxio.md).

## 1. Worker Storage

Each worker caches data in a local [page store](https://documentation.alluxio.io/ee-ai-en/administration/pages/iRTxT4smG58AwFmCihOx#id-5.-worker-storage-the-page-store). This section covers the storage backend choice, capacity sizing, and disk layout.

### Configuring Page Store Location

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
The Operator supports two page store backends.

**Default (hostPath):** The worker writes cache to the node's filesystem at `/mnt/alluxio/pagestore`.

```yaml
spec:
  worker:
    pagestore:
      # Defaults to hostPath: /mnt/alluxio/pagestore on the node's filesystem.
      size: 100Gi
      reservedSize: 10Gi
```

**PVC-backed:** To persist worker cache data across pod restarts or rescheduling, specify a PersistentVolumeClaim (PVC) for the page store.

```yaml
spec:
  worker:
    pagestore:
      type: persistentVolumeClaim
      storageClass: ""    # defaults to "standard"; empty string = static binding
      size: 100Gi
      reservedSize: 10Gi
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Set the page store directory and size via Alluxio properties. On Docker, pass them through `ALLUXIO_JAVA_OPTS` and mount the host directory as a volume so the cache survives container recreation:

```shell
docker run ... \
  -v /data/alluxio-cache:/data/alluxio-cache \
  -e ALLUXIO_JAVA_OPTS=" \
    -Dalluxio.worker.page.store.dirs=/data/alluxio-cache \
    -Dalluxio.worker.page.store.sizes=<CACHE_SIZE>" \
  alluxio/alluxio-enterprise worker
```

On bare-metal, set the same properties in `conf/alluxio-site.properties`:

```properties
alluxio.worker.page.store.dirs=/data/alluxio-cache
alluxio.worker.page.store.sizes=<CACHE_SIZE>
```

{% endtab %}
{% endtabs %}

### Sizing the Page Store

* `size`: Per-worker cache capacity. Must not exceed available disk on the worker node.
* `reservedSize`: Space reserved for internal operations (temporary page writes, metadata caching). Set to \~10% of `size`, typically 10–100 GiB.
* Ensure `size + reservedSize ≤ available disk space`.

Cloud providers advertise disk size in GB (base-10), while Kubernetes `Gi` is base-2. A "1000 GB" EBS volume provides \~931 GiB. Set `size` to \~90% of actual available space (check with `df -h <page-store-path>`) to leave room for filesystem overhead and `reservedSize`. A too-large `size` causes workers to crash with `quota (NNN) exceeds the total disk space`.

### Multi-Disk Configuration

For nodes with multiple data disks, configure the page store to span all of them — this distributes page I/O across disks and increases aggregate throughput. Use comma-separated paths and sizes:

```yaml
spec:
  worker:
    pagestore:
      hostPath: /mnt/disk1/alluxio/pagestore,/mnt/disk2/alluxio/pagestore
      size: 800Gi,800Gi
      reservedSize: 100Gi
```

Each directory must map to a separate physical disk — colocating multiple page store directories on the same disk provides no benefit.

An alternative is to use RAID 0 at the OS level to present multiple disks as a single logical volume, then configure a single `hostPath`. This simplifies the Alluxio configuration but couples the lifetime of all disks — a single disk failure loses the entire array.

### Heterogeneous Workers

The `workerGroups` mechanism described in this section is specific to the Kubernetes Operator. On Docker/bare-metal deployments, heterogeneous workers are achieved by running each worker with its own `alluxio-site.properties` file that specifies the per-node page store paths and sizes.

When a cluster has workers with different disk specifications (e.g., one group with a 1 TB disk, another with two 800 GB disks), use `workerGroups` to define distinct configurations per group.

**Step 1: Group and label nodes:**

```shell
# Label nodes with one disk
kubectl label nodes <node-name> apps.alluxio.com/disks=1
# Label nodes with two disks
kubectl label nodes <node-name> apps.alluxio.com/disks=2
```

**Step 2: Define worker groups and enable capacity-aware hash ring.**

For heterogeneous clusters, set `alluxio.user.worker.selection.policy.consistent.hash.provider.impl` to `CAPACITY` so workers with more storage receive a proportionally larger share of data. For details on this property, see [Optimizing for Heterogeneous Workers](/ee-ai-en/administration/managing-ring.md#optimizing-for-heterogeneous-workers).

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.user.worker.selection.policy.consistent.hash.provider.impl: CAPACITY

  # Common configurations for all workers
  worker:
    resources:
      limits:
        memory: 40Gi
      requests:
        memory: 36Gi
    jvmOptions: ["-Xmx20g", "-Xms20g", "-XX:MaxDirectMemorySize=16g"]

  # Define specific configurations for each worker group
  workerGroups:
  - worker:
      count: 10
      nodeSelector:
        apps.alluxio.com/disks: 1
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore
        size: 1Ti
        reservedSize: 100Gi
  - worker:
      count: 12
      nodeSelector:
        apps.alluxio.com/disks: 2
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore,/mnt/disk2/alluxio/pagestore
        size: 800Gi,800Gi
        reservedSize: 100Gi
```

{% hint style="info" %}
While this provides flexibility, it is crucial to ensure consistency within each worker group. Misconfigurations can lead to unexpected errors.
{% endhint %}

## 2. Resource and JVM Tuning

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Configure per-component resource limits and JVM options in `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    count: 2
    resources:
      # For production workers, set requests equal to limits so the pod
      # runs in the Guaranteed QoS class and is the last to be evicted
      # under node pressure.
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "12"
        memory: "36Gi"
    jvmOptions:
      - "-Xmx22g"
      - "-Xms22g"
      - "-XX:MaxDirectMemorySize=10g"
  coordinator:
    resources:
      # Coordinator should also run as Guaranteed QoS — set requests == limits.
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "12"
        memory: "36Gi"
    jvmOptions:
      - "-Xmx4g"
      - "-Xms1g"
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
On Docker, set JVM options through `ALLUXIO_JAVA_OPTS` and the container resource limits via `--memory` and `--cpus`:

```shell
docker run ... \
  --memory=36g \
  --cpus=12 \
  -e ALLUXIO_JAVA_OPTS=" \
    -Xmx22g -Xms22g -XX:MaxDirectMemorySize=10g" \
  alluxio/alluxio-enterprise worker
```

On bare-metal, set JVM options in `conf/alluxio-env.sh`:

```shell
ALLUXIO_WORKER_JAVA_OPTS+=" -Xmx22g -Xms22g -XX:MaxDirectMemorySize=10g"
ALLUXIO_COORDINATOR_JAVA_OPTS+=" -Xmx4g -Xms1g"
```

{% endtab %}
{% endtabs %}

### Memory Limit Formula

```
memory limit ≥ -Xmx + -XX:MaxDirectMemorySize + 2–4 GiB (JVM overhead)
```

For the worker config above (`-Xmx22g`, `-XX:MaxDirectMemorySize=10g`): minimum limit is 22 + 10 + 2 = 34 GiB, set to 36 GiB in the example.

{% hint style="info" %}
If `-XX:MaxDirectMemorySize` is omitted, the JVM defaults it to the same value as `-Xmx`, so the container limit typically needs to be 2.5× `-Xmx` or more.
{% endhint %}

### Diagnosing OOM

If a worker is killed due to OOM (exit code 137), confirm the cause using the commands below.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Check pod events
kubectl describe pod <worker-pod-name> -n alx-ns
# Look for "OOMKilled" or "Exit Code: 137"

# Check the previous container's logs (before crash)
kubectl logs <worker-pod-name> -n alx-ns --previous | tail -50
# Look for "OutOfMemoryError" or "java.lang.OutOfMemoryError"
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Docker: inspect the exited container
docker inspect <container-id> --format='{{.State.OOMKilled}} {{.State.ExitCode}}'
# "true 137" indicates OOM

# Tail worker logs for OutOfMemoryError
tail -100 /opt/alluxio/logs/worker.log | grep -i OutOfMemoryError
```

{% endtab %}
{% endtabs %}

| Symptom                                            | Root Cause                                            | Fix                                                                      |
| -------------------------------------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------ |
| `Exit Code 137`, no Java error                     | Container limit exceeded — killed by Linux OOM killer | Increase `resources.limits.memory`                                       |
| `java.lang.OutOfMemoryError: Java heap space`      | `-Xmx` too small                                      | Increase `-Xmx` and raise container limit accordingly                    |
| `java.lang.OutOfMemoryError: Direct buffer memory` | `-XX:MaxDirectMemorySize` too small                   | Increase `-XX:MaxDirectMemorySize` and raise container limit accordingly |

## 3. Worker Network Configuration

### Binding the Worker to a Specific NIC

To bind the worker to a specific local NIC (and its associated IP), set the bind device for each service (example uses `NIC1`):

```properties
alluxio.worker.rpc.bind.device=NIC1
alluxio.worker.web.bind.device=NIC1
alluxio.worker.data.bind.device=NIC1
```