Worker Configuration

This guide covers configuration for individual Alluxio workers: storage backends, capacity sizing, resource limits, JVM tuning, and network binding. For hash-ring-related worker operations (adding, removing, restarting, identity persistence), see Hash Ring and Worker Lifecycle. For cluster-wide operations (scaling, upgrades, coordinator, UFS, multi-tenancy), see Cluster Management.

1. Worker Storage

Each worker caches data in a local page store. This section covers the storage backend choice, capacity sizing, and disk layout.

Configuring Page Store Location

The Operator supports two page store backends.

Default (hostPath): The worker writes cache to the node's filesystem at /mnt/alluxio/pagestore.

spec:
  worker:
    pagestore:
      # Defaults to hostPath: /mnt/alluxio/pagestore on the node's filesystem.
      size: 100Gi
      reservedSize: 10Gi

PVC-backed: To persist worker cache data across pod restarts or rescheduling, specify a PersistentVolumeClaim (PVC) for the page store.

spec:
  worker:
    pagestore:
      type: persistentVolumeClaim
      storageClass: ""    # defaults to "standard"; empty string = static binding
      size: 100Gi
      reservedSize: 10Gi

Sizing the Page Store

  • size: Per-worker cache capacity. Must not exceed available disk on the worker node.

  • reservedSize: Space reserved for internal operations (temporary page writes, metadata caching). Set to ~10% of size, typically 10–100 GiB.

  • Ensure size + reservedSize ≤ available disk space.

Cloud providers advertise disk size in GB (base-10), while Kubernetes Gi is base-2. A "1000 GB" EBS volume provides ~931 GiB. Set size to ~90% of actual available space (check with df -h <page-store-path>) to leave room for filesystem overhead and reservedSize. A too-large size causes workers to crash with quota (NNN) exceeds the total disk space.

Multi-Disk Configuration

For nodes with multiple data disks, configure the page store to span all of them — this distributes page I/O across disks and increases aggregate throughput. Use comma-separated paths and sizes:

Each directory must map to a separate physical disk — colocating multiple page store directories on the same disk provides no benefit.

An alternative is to use RAID 0 at the OS level to present multiple disks as a single logical volume, then configure a single hostPath. This simplifies the Alluxio configuration but couples the lifetime of all disks — a single disk failure loses the entire array.

Heterogeneous Workers

The workerGroups mechanism described in this section is specific to the Kubernetes Operator. On Docker/bare-metal deployments, heterogeneous workers are achieved by running each worker with its own alluxio-site.properties file that specifies the per-node page store paths and sizes.

When a cluster has workers with different disk specifications (e.g., one group with a 1 TB disk, another with two 800 GB disks), use workerGroups to define distinct configurations per group.

Step 1: Group and label nodes:

Step 2: Define worker groups and enable capacity-aware hash ring.

For heterogeneous clusters, set alluxio.user.worker.selection.policy.consistent.hash.provider.impl to CAPACITY so workers with more storage receive a proportionally larger share of data. For details on this property, see Optimizing for Heterogeneous Workers.

circle-info

While this provides flexibility, it is crucial to ensure consistency within each worker group. Misconfigurations can lead to unexpected errors.

2. Resource and JVM Tuning

Configure per-component resource limits and JVM options in alluxio-cluster.yaml:

Memory Limit Formula

For the worker config above (-Xmx22g, -XX:MaxDirectMemorySize=10g): minimum limit is 22 + 10 + 2 = 34 GiB, set to 36 GiB in the example.

circle-info

If -XX:MaxDirectMemorySize is omitted, the JVM defaults it to the same value as -Xmx, so the container limit typically needs to be 2.5× -Xmx or more.

Diagnosing OOM

If a worker is killed due to OOM (exit code 137), confirm the cause using the commands below.

Symptom
Root Cause
Fix

Exit Code 137, no Java error

Container limit exceeded — killed by Linux OOM killer

Increase resources.limits.memory

java.lang.OutOfMemoryError: Java heap space

-Xmx too small

Increase -Xmx and raise container limit accordingly

java.lang.OutOfMemoryError: Direct buffer memory

-XX:MaxDirectMemorySize too small

Increase -XX:MaxDirectMemorySize and raise container limit accordingly

3. Worker Network Configuration

Binding the Worker to a Specific NIC

To bind the worker to a specific local NIC (and its associated IP), set the bind device for each service (example uses NIC1):

Last updated