Worker Configuration

This guide covers configuration for individual Alluxio workers: storage backends, capacity sizing, resource limits, JVM tuning, and network binding. For hash-ring-related worker operations (adding, removing, restarting, identity persistence), see Hash Ring and Worker Lifecycle. For cluster-wide operations (scaling, upgrades, coordinator, UFS, multi-tenancy), see Cluster Management.

1. Worker Storage

Each worker caches data in a local page store. This section covers the storage backend choice, capacity sizing, and disk layout.

Configuring Page Store Location

The Operator supports two page store backends.

Default (hostPath): The worker writes cache to the node's filesystem at /mnt/alluxio/pagestore.

spec:
  worker:
    pagestore:
      # Defaults to hostPath: /mnt/alluxio/pagestore on the node's filesystem.
      size: 100Gi
      reservedSize: 10Gi

PVC-backed: To persist worker cache data across pod restarts or rescheduling, specify a PersistentVolumeClaim (PVC) for the page store.

spec:
  worker:
    pagestore:
      type: persistentVolumeClaim
      storageClass: ""    # defaults to "standard"; empty string = static binding
      size: 100Gi
      reservedSize: 10Gi

Sizing the Page Store

  • size: Per-worker cache capacity. Must not exceed available disk on the worker node.

  • reservedSize: Space reserved for internal operations (temporary page writes, metadata caching). Set to ~10% of size, typically 10–100 GiB.

  • Ensure size + reservedSize ≤ available disk space.

Cloud providers advertise disk size in GB (base-10), while Kubernetes Gi is base-2. A "1000 GB" EBS volume provides ~931 GiB. Set size to ~90% of actual available space (check with df -h <page-store-path>) to leave room for filesystem overhead and reservedSize. A too-large size causes workers to crash with quota (NNN) exceeds the total disk space.

Multi-Disk Configuration

For nodes with multiple data disks, configure the page store to span all of them — this distributes page I/O across disks and increases aggregate throughput. Use comma-separated paths and sizes:

Each directory must map to a separate physical disk — colocating multiple page store directories on the same disk provides no benefit.

An alternative is to use RAID 0 at the OS level to present multiple disks as a single logical volume, then configure a single hostPath. This simplifies the Alluxio configuration but couples the lifetime of all disks — a single disk failure loses the entire array.

Heterogeneous Workers

The workerGroups mechanism described in this section is specific to the Kubernetes Operator. On Docker/bare-metal deployments, heterogeneous workers are achieved by running each worker with its own alluxio-site.properties file that specifies the per-node page store paths and sizes.

When a cluster has workers with different disk specifications (e.g., one group with a 1 TB disk, another with two 800 GB disks), use workerGroups to define distinct configurations per group.

Step 1: Group and label nodes:

Step 2: Define worker groups and enable capacity-aware hash ring.

For heterogeneous clusters, set alluxio.user.worker.selection.policy.consistent.hash.provider.impl to CAPACITY so workers with more storage receive a proportionally larger share of data. For details on this property, see Optimizing for Heterogeneous Workers.

While this provides flexibility, it is crucial to ensure consistency within each worker group. Misconfigurations can lead to unexpected errors.

2. Resource and JVM Tuning

Configure per-component resource limits and JVM options in alluxio-cluster.yaml:

Memory Limit Formula

For the worker config above (-Xmx22g, -XX:MaxDirectMemorySize=10g): minimum limit is 22 + 10 + 2 = 34 GiB, set to 36 GiB in the example.

If -XX:MaxDirectMemorySize is omitted, the JVM defaults it to the same value as -Xmx, so the container limit typically needs to be 2.5× -Xmx or more.

Diagnosing OOM

If a worker is killed due to OOM (exit code 137), confirm the cause using the commands below.

Symptom
Root Cause
Fix

Exit Code 137, no Java error

Container limit exceeded — killed by Linux OOM killer

Increase resources.limits.memory

java.lang.OutOfMemoryError: Java heap space

-Xmx too small

Increase -Xmx and raise container limit accordingly

java.lang.OutOfMemoryError: Direct buffer memory

-XX:MaxDirectMemorySize too small

Increase -XX:MaxDirectMemorySize and raise container limit accordingly

3. Worker Network Configuration

Binding the Worker to a Specific NIC

To bind the worker to a specific local NIC (and its associated IP), set the bind device for each service (example uses NIC1):

Last updated