# Worker Configuration

This guide covers configuration for individual Alluxio workers: storage backends, capacity sizing, resource limits, JVM tuning, and network binding. For hash-ring-related worker operations (adding, removing, restarting, identity persistence), see [Hash Ring and Worker Lifecycle](/ee-ai-en/ai-3.8-15.1.x/administration/managing-ring.md). For cluster-wide operations (scaling, upgrades, coordinator, UFS, multi-tenancy), see [Cluster Management](/ee-ai-en/ai-3.8-15.1.x/administration/managing-alluxio.md).

## 1. Worker Storage

Each worker caches data in a local [page store](https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/pages/F0FPs9XCU262AhQbAgXx#id-5.-worker-storage-the-page-store). This section covers the storage backend choice, capacity sizing, and disk layout.

### Configuring Page Store Location

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
The Operator supports two page store backends.

**Default (hostPath):** The worker writes cache to the node's filesystem at `/mnt/alluxio/pagestore`.

```yaml
spec:
  worker:
    pagestore:
      # Defaults to hostPath: /mnt/alluxio/pagestore on the node's filesystem.
      size: 100Gi
      reservedSize: 10Gi
```

**PVC-backed:** To persist worker cache data across pod restarts or rescheduling, specify a PersistentVolumeClaim (PVC) for the page store.

```yaml
spec:
  worker:
    pagestore:
      type: persistentVolumeClaim
      storageClass: ""    # defaults to "standard"; empty string = static binding
      size: 100Gi
      reservedSize: 10Gi
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Set the page store directory and size via Alluxio properties. On Docker, pass them through `ALLUXIO_JAVA_OPTS` and mount the host directory as a volume so the cache survives container recreation:

```shell
docker run ... \
  -v /data/alluxio-cache:/data/alluxio-cache \
  -e ALLUXIO_JAVA_OPTS=" \
    -Dalluxio.worker.page.store.dirs=/data/alluxio-cache \
    -Dalluxio.worker.page.store.sizes=<CACHE_SIZE>" \
  alluxio/alluxio-enterprise worker
```

On bare-metal, set the same properties in `conf/alluxio-site.properties`:

```properties
alluxio.worker.page.store.dirs=/data/alluxio-cache
alluxio.worker.page.store.sizes=<CACHE_SIZE>
```

{% endtab %}
{% endtabs %}

### Sizing the Page Store

* `size`: Per-worker cache capacity. Must not exceed available disk on the worker node.
* `reservedSize`: Space reserved for internal operations (temporary page writes, metadata caching). Set to \~10% of `size`, typically 10–100 GiB.
* Ensure `size + reservedSize ≤ available disk space`.

Cloud providers advertise disk size in GB (base-10), while Kubernetes `Gi` is base-2. A "1000 GB" EBS volume provides \~931 GiB. Set `size` to \~90% of actual available space (check with `df -h <page-store-path>`) to leave room for filesystem overhead and `reservedSize`. A too-large `size` causes workers to crash with `quota (NNN) exceeds the total disk space`.

### Multi-Disk Configuration

For nodes with multiple data disks, configure the page store to span all of them — this distributes page I/O across disks and increases aggregate throughput. Use comma-separated paths and sizes:

```yaml
spec:
  worker:
    pagestore:
      hostPath: /mnt/disk1/alluxio/pagestore,/mnt/disk2/alluxio/pagestore
      size: 800Gi,800Gi
      reservedSize: 100Gi
```

Each directory must map to a separate physical disk — colocating multiple page store directories on the same disk provides no benefit.

An alternative is to use RAID 0 at the OS level to present multiple disks as a single logical volume, then configure a single `hostPath`. This simplifies the Alluxio configuration but couples the lifetime of all disks — a single disk failure loses the entire array.

### Heterogeneous Workers

The `workerGroups` mechanism described in this section is specific to the Kubernetes Operator. On Docker/bare-metal deployments, heterogeneous workers are achieved by running each worker with its own `alluxio-site.properties` file that specifies the per-node page store paths and sizes.

When a cluster has workers with different disk specifications (e.g., one group with a 1 TB disk, another with two 800 GB disks), use `workerGroups` to define distinct configurations per group.

**Step 1: Group and label nodes:**

```shell
# Label nodes with one disk
kubectl label nodes <node-name> apps.alluxio.com/disks=1
# Label nodes with two disks
kubectl label nodes <node-name> apps.alluxio.com/disks=2
```

**Step 2: Define worker groups and enable capacity-aware hash ring.**

For heterogeneous clusters, set `alluxio.user.worker.selection.policy.consistent.hash.provider.impl` to `CAPACITY` so workers with more storage receive a proportionally larger share of data. For details on this property, see [Optimizing for Heterogeneous Workers](/ee-ai-en/ai-3.8-15.1.x/administration/managing-ring.md#optimizing-for-heterogeneous-workers).

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.user.worker.selection.policy.consistent.hash.provider.impl: CAPACITY

  # Common configurations for all workers
  worker:
    resources:
      limits:
        memory: 40Gi
      requests:
        memory: 36Gi
    jvmOptions: ["-Xmx20g", "-Xms20g", "-XX:MaxDirectMemorySize=16g"]

  # Define specific configurations for each worker group
  workerGroups:
  - worker:
      count: 10
      nodeSelector:
        apps.alluxio.com/disks: 1
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore
        size: 1Ti
        reservedSize: 100Gi
  - worker:
      count: 12
      nodeSelector:
        apps.alluxio.com/disks: 2
      pagestore:
        hostPath: /mnt/disk1/alluxio/pagestore,/mnt/disk2/alluxio/pagestore
        size: 800Gi,800Gi
        reservedSize: 100Gi
```

{% hint style="info" %}
While this provides flexibility, it is crucial to ensure consistency within each worker group. Misconfigurations can lead to unexpected errors.
{% endhint %}

## 2. Resource and JVM Tuning

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Configure per-component resource limits and JVM options in `alluxio-cluster.yaml`:

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  worker:
    count: 2
    resources:
      # For production workers, set requests equal to limits so the pod
      # runs in the Guaranteed QoS class and is the last to be evicted
      # under node pressure.
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "12"
        memory: "36Gi"
    jvmOptions:
      - "-Xmx22g"
      - "-Xms22g"
      - "-XX:MaxDirectMemorySize=10g"
  coordinator:
    resources:
      # Coordinator should also run as Guaranteed QoS — set requests == limits.
      limits:
        cpu: "12"
        memory: "36Gi"
      requests:
        cpu: "12"
        memory: "36Gi"
    jvmOptions:
      - "-Xmx4g"
      - "-Xms1g"
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
On Docker, set JVM options through `ALLUXIO_JAVA_OPTS` and the container resource limits via `--memory` and `--cpus`:

```shell
docker run ... \
  --memory=36g \
  --cpus=12 \
  -e ALLUXIO_JAVA_OPTS=" \
    -Xmx22g -Xms22g -XX:MaxDirectMemorySize=10g" \
  alluxio/alluxio-enterprise worker
```

On bare-metal, set JVM options in `conf/alluxio-env.sh`:

```shell
ALLUXIO_WORKER_JAVA_OPTS+=" -Xmx22g -Xms22g -XX:MaxDirectMemorySize=10g"
ALLUXIO_COORDINATOR_JAVA_OPTS+=" -Xmx4g -Xms1g"
```

{% endtab %}
{% endtabs %}

### Memory Limit Formula

```
memory limit ≥ -Xmx + -XX:MaxDirectMemorySize + 2–4 GiB (JVM overhead)
```

For the worker config above (`-Xmx22g`, `-XX:MaxDirectMemorySize=10g`): minimum limit is 22 + 10 + 2 = 34 GiB, set to 36 GiB in the example.

{% hint style="info" %}
If `-XX:MaxDirectMemorySize` is omitted, the JVM defaults it to the same value as `-Xmx`, so the container limit typically needs to be 2.5× `-Xmx` or more.
{% endhint %}

### Diagnosing OOM

If a worker is killed due to OOM (exit code 137), confirm the cause using the commands below.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Check pod events
kubectl describe pod <worker-pod-name> -n alx-ns
# Look for "OOMKilled" or "Exit Code: 137"

# Check the previous container's logs (before crash)
kubectl logs <worker-pod-name> -n alx-ns --previous | tail -50
# Look for "OutOfMemoryError" or "java.lang.OutOfMemoryError"
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Docker: inspect the exited container
docker inspect <container-id> --format='{{.State.OOMKilled}} {{.State.ExitCode}}'
# "true 137" indicates OOM

# Tail worker logs for OutOfMemoryError
tail -100 /opt/alluxio/logs/worker.log | grep -i OutOfMemoryError
```

{% endtab %}
{% endtabs %}

| Symptom                                            | Root Cause                                            | Fix                                                                      |
| -------------------------------------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------ |
| `Exit Code 137`, no Java error                     | Container limit exceeded — killed by Linux OOM killer | Increase `resources.limits.memory`                                       |
| `java.lang.OutOfMemoryError: Java heap space`      | `-Xmx` too small                                      | Increase `-Xmx` and raise container limit accordingly                    |
| `java.lang.OutOfMemoryError: Direct buffer memory` | `-XX:MaxDirectMemorySize` too small                   | Increase `-XX:MaxDirectMemorySize` and raise container limit accordingly |

## 3. Worker Network Configuration

### Binding the Worker to a Specific NIC

To bind the worker to a specific local NIC (and its associated IP), set the bind device for each service (example uses `NIC1`):

```properties
alluxio.worker.rpc.bind.device=NIC1
alluxio.worker.web.bind.device=NIC1
alluxio.worker.data.bind.device=NIC1
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/managing-worker.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
