Hash Ring and Worker Lifecycle
This guide covers the consistent hash ring — its configuration, worker lifecycle operations, and diagnostic procedures. For per-worker configuration (storage, resources, JVM, network), see Worker Configuration. For cluster-wide operations (scaling, upgrades, coordinator), see Cluster Management.
Hash ring settings should be defined during the initial cluster setup. Modifying these configurations on a running cluster is a destructive operation that will cause all cached data to be lost, as it changes how data is mapped to workers.
Alluxio uses a consistent hash ring to map data to workers in a decentralized manner. You can fine-tune its behavior to optimize for different cluster environments and workloads.
1. Pre-Deployment Configuration
Set the following properties before first deployment:
alluxio.user.dynamic.consistent.hash.ring.enabled
true (dynamic)
Set to false (static) only if you need a stable ring view despite temporary worker unavailability — see Configuring the Hash Ring Mode
alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker
2000
Rarely — only for very small or heavily imbalanced clusters. See Adjusting Virtual Nodes
alluxio.user.worker.selection.policy.consistent.hash.provider.impl
DEFAULT
Set to CAPACITY for heterogeneous worker clusters. See Optimizing for Heterogeneous Workers
Set under .spec.properties in alluxio-cluster.yaml:
spec:
properties:
alluxio.user.dynamic.consistent.hash.ring.enabled: "true"
alluxio.user.worker.selection.policy.consistent.hash.provider.impl: DEFAULTSet in conf/alluxio-site.properties (applies to client, worker, and coordinator processes):
alluxio.user.dynamic.consistent.hash.ring.enabled=true
alluxio.user.worker.selection.policy.consistent.hash.provider.impl=DEFAULT2. Hash Ring Configuration
Configuring the Hash Ring Mode
The consistent hash ring can operate in two modes: dynamic (default) or static.
In dynamic mode (default), the hash ring includes only online workers. When a worker goes offline, it is removed from the ring after the liveness timeout, and its virtual nodes are redistributed to live workers. This is the best choice for most deployments — the ring adapts automatically to permanent topology changes such as scaling or node replacement.
In static mode, the hash ring retains all registered workers regardless of their online status. This mode is designed for planned short-term maintenance — rolling software upgrades, hardware servicing, or brief restarts where workers are expected to rejoin quickly. By keeping offline workers in the ring, Alluxio avoids resharding cached data: when the worker comes back, it reclaims its original ring slots and all its cached data is immediately routable again. The trade-off is that during the downtime window, requests that hash to the offline worker fall through to UFS.
To configure the mode, set the alluxio.user.dynamic.consistent.hash.ring.enabled property. Set it to true for dynamic mode (the default) or false for static mode.
Adjusting Virtual Nodes for Load Balancing
To ensure an even distribution of data and I/O requests, Alluxio uses virtual nodes. Each worker is mapped to multiple virtual nodes on the hash ring, which helps to balance the load more effectively across the cluster.
You can adjust the number of virtual nodes per worker by configuring the alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker property (default: 2000). Adjusting this value can help fine-tune load distribution, especially in clusters with diverse workloads or a small number of workers.
Optimizing for Heterogeneous Workers
By default, the consistent hashing algorithm assumes that all workers have equal capacity. In clusters with heterogeneous workers (e.g., different storage capacities or network speeds), you can enable capacity-based allocation for more balanced resource utilization. This ensures that workers with more storage handle a proportionally larger share of data.
To enable this, set the alluxio.user.worker.selection.policy.consistent.hash.provider.impl property to CAPACITY. The default value is DEFAULT, which allocates an equal number of virtual nodes to each worker.
For the worker-side YAML configuration (labeling nodes, workerGroups), see Heterogeneous Workers.
Worker Liveness Detection
Each worker maintains active communication with etcd. If a worker fails to communicate with etcd within the timeout period, it is considered offline:
In dynamic mode, an OFFLINE worker's virtual nodes are removed from the hash ring after this timeout and redistributed to live workers. In static mode, the OFFLINE entry remains in the ring — requests hashing to it fall through to UFS until the worker rejoins or is explicitly removed.
Client Worker List Refresh
Clients maintain a local snapshot of the worker list and refresh it periodically from etcd:
After adding or removing workers, clients will reflect the change within one refresh interval. To force immediate propagation during incident recovery, restart the client-side process or reduce this interval temporarily.
3. Worker Lifecycle on the Ring
Alluxio's decentralized architecture relies on workers that are managed via a consistent hash ring. This section covers operational procedures for workers joining, leaving, and restarting on the ring.
Checking Worker Status
To see a list of all registered workers and their current status:
Diagnosing Hash Ring Bloat from OFFLINE Entries
A healthy cluster should show one ONLINE entry per configured worker. If alluxio info nodes shows more total entries than expected, or the ONLINE count is lower than the worker count, stale OFFLINE entries are accumulating in etcd.
Impact (static mode): In static mode (alluxio.user.dynamic.consistent.hash.ring.enabled=false), OFFLINE entries remain in the ring indefinitely. With stale entries present, a proportional fraction of hash lookups land on OFFLINE nodes and fall through to UFS — turning cache hits into direct S3/GCS reads at native object-store speeds.
In dynamic mode (default), OFFLINE entries are automatically removed from the ring after alluxio.worker.failure.detection.timeout (see Worker Liveness Detection), so ring bloat does not persist. However, each restart with a new UUID still remaps virtual nodes, making previously cached data temporarily unreachable.
Verify the ring is healthy — ONLINE count should equal your configured worker count:
To remove a stale OFFLINE entry from etcd, get its UUID from the alluxio info nodes output and run:
Adding or Removing Workers
New workers register in etcd and join the hash ring automatically at startup. Removing a worker redistributes its hash ring portion to remaining workers, causing a temporary increase in cache misses while data is re-fetched from UFS.
Adjust the worker.count in alluxio-cluster.yaml and apply — see Scaling the Cluster for the full procedure.
Start a new worker process on the target host (docker: docker run ... alluxio/alluxio-enterprise worker; bare-metal: bin/alluxio process start worker). To remove, stop the worker process; see Removing a Worker Permanently for the deregistration step.
Removing a Worker Permanently
When decommissioning a node, stop the worker first and then explicitly deregister it from etcd so its entry does not remain as a stale OFFLINE node.
Step 1: Stop the worker.
Scale down the worker count in alluxio-cluster.yaml and apply:
Stop the worker process on the decommissioned host (docker: docker stop alluxio-worker; bare-metal: bin/alluxio process stop worker).
Step 2: Deregister the worker from etcd. Get the worker's UUID from alluxio info nodes, then run alluxio process remove-worker:
In dynamic mode, skipping Step 2 is usually safe — the OFFLINE entry is automatically purged after alluxio.worker.failure.detection.timeout. In static mode, Step 2 is required to prevent the stale entry from staying in the hash ring indefinitely.
Restarting a Worker
When a worker restarts, it is temporarily marked as offline. With identity persistence configured, the worker rejoins with the same UUID — preserving its ring position, cached data, and load distribution.
Without identity persistence, each restart generates a new UUID with different ring slots, causing cache misses and potential ring bloat. See Configuring the Hash Ring Mode and Diagnosing Hash Ring Bloat from OFFLINE Entries for details.
Persisting Worker Identity
Set worker.systemInfo.hostPath in alluxio-cluster.yaml before first deployment:
Pre-create an empty file on the host before first start, then mount it as a volume. Alluxio writes the UUID into it on startup, and it survives container recreation:
Add to the worker docker run command:
Do not use -v without pre-creating the file on the host first. If the host path does not exist, Docker creates a directory at that path instead of a file, causing Alluxio to fail with IOException: Is a directory.
For the full Docker setup guide, see Appendix B: Worker Identity.
Set alluxio.worker.identity.uuid.file.path to a path that survives reboots:
The file is created automatically on first start. To manually pin an existing worker's UUID (e.g., when migrating to a persistent path), first get the current UUID from alluxio info nodes, then write it to the configured path.
If a worker is migrated to a different host, copy the identity file to the same path on the new host before starting the worker. Without it, the worker registers with a new UUID and all cached data from the previous identity becomes unreachable.
Last updated