# Hash Ring and Worker Lifecycle

This guide covers the consistent hash ring — its configuration, worker lifecycle operations, and diagnostic procedures. For per-worker configuration (storage, resources, JVM, network), see [Worker Configuration](/ee-ai-en/ai-3.8-15.1.x/administration/managing-worker.md). For cluster-wide operations (scaling, upgrades, coordinator), see [Cluster Management](/ee-ai-en/ai-3.8-15.1.x/administration/managing-alluxio.md).

{% hint style="warning" %}
Hash ring settings should be defined during the initial cluster setup. Modifying these configurations on a running cluster is a destructive operation that will cause all cached data to be lost, as it changes how data is mapped to workers.
{% endhint %}

Alluxio uses a consistent hash ring to map data to workers in a decentralized manner. You can fine-tune its behavior to optimize for different cluster environments and workloads.

## 1. Pre-Deployment Configuration

Set the following properties before first deployment:

| Property                                                                             | Default          | When to change                                                                                                                                                              |
| ------------------------------------------------------------------------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.user.dynamic.consistent.hash.ring.enabled`                                  | `true` (dynamic) | Set to `false` (static) only if you need a stable ring view despite temporary worker unavailability — see [Configuring the Hash Ring Mode](#configuring-the-hash-ring-mode) |
| `alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker` | `2000`           | Rarely — only for very small or heavily imbalanced clusters. See [Adjusting Virtual Nodes](#adjusting-virtual-nodes-for-load-balancing)                                     |
| `alluxio.user.worker.selection.policy.consistent.hash.provider.impl`                 | `DEFAULT`        | Set to `CAPACITY` for heterogeneous worker clusters. See [Optimizing for Heterogeneous Workers](#optimizing-for-heterogeneous-workers)                                      |

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Set under `.spec.properties` in `alluxio-cluster.yaml`:

```yaml
spec:
  properties:
    alluxio.user.dynamic.consistent.hash.ring.enabled: "true"
    alluxio.user.worker.selection.policy.consistent.hash.provider.impl: DEFAULT
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Set in `conf/alluxio-site.properties` (applies to client, worker, and coordinator processes):

```properties
alluxio.user.dynamic.consistent.hash.ring.enabled=true
alluxio.user.worker.selection.policy.consistent.hash.provider.impl=DEFAULT
```

{% endtab %}
{% endtabs %}

## 2. Hash Ring Configuration

### Configuring the Hash Ring Mode

The consistent hash ring can operate in two modes: dynamic (default) or static.

In **dynamic mode** (default), the hash ring includes only online workers. When a worker goes offline, it is removed from the ring after the liveness timeout, and its virtual nodes are redistributed to live workers. This is the best choice for most deployments — the ring adapts automatically to permanent topology changes such as scaling or node replacement.

In **static mode**, the hash ring retains all registered workers regardless of their online status. This mode is designed for **planned short-term maintenance** — rolling software upgrades, hardware servicing, or brief restarts where workers are expected to rejoin quickly. By keeping offline workers in the ring, Alluxio avoids resharding cached data: when the worker comes back, it reclaims its original ring slots and all its cached data is immediately routable again. The trade-off is that during the downtime window, requests that hash to the offline worker fall through to UFS.

To configure the mode, set the `alluxio.user.dynamic.consistent.hash.ring.enabled` property. Set it to `true` for dynamic mode (the default) or `false` for static mode.

### Adjusting Virtual Nodes for Load Balancing

To ensure an even distribution of data and I/O requests, Alluxio uses virtual nodes. Each worker is mapped to multiple virtual nodes on the hash ring, which helps to balance the load more effectively across the cluster.

You can adjust the number of virtual nodes per worker by configuring the `alluxio.user.worker.selection.policy.consistent.hash.virtual.node.count.per.worker` property (default: `2000`). Adjusting this value can help fine-tune load distribution, especially in clusters with diverse workloads or a small number of workers.

### Optimizing for Heterogeneous Workers

By default, the consistent hashing algorithm assumes that all workers have equal capacity. In clusters with heterogeneous workers (e.g., different storage capacities or network speeds), you can enable capacity-based allocation for more balanced resource utilization. This ensures that workers with more storage handle a proportionally larger share of data.

To enable this, set the `alluxio.user.worker.selection.policy.consistent.hash.provider.impl` property to `CAPACITY`. The default value is `DEFAULT`, which allocates an equal number of virtual nodes to each worker.

For the worker-side YAML configuration (labeling nodes, `workerGroups`), see [Heterogeneous Workers](/ee-ai-en/ai-3.8-15.1.x/administration/managing-worker.md#heterogeneous-workers).

### Worker Liveness Detection

Each worker maintains active communication with etcd. If a worker fails to communicate with etcd within the timeout period, it is considered offline:

```properties
# increase if workers are slow to reconnect after transient network issues (default: 15s)
alluxio.worker.failure.detection.timeout=30s
```

In dynamic mode, an `OFFLINE` worker's virtual nodes are removed from the hash ring after this timeout and redistributed to live workers. In static mode, the `OFFLINE` entry remains in the ring — requests hashing to it fall through to UFS until the worker rejoins or is explicitly removed.

### Client Worker List Refresh

Clients maintain a local snapshot of the worker list and refresh it periodically from etcd:

```properties
# reduce for faster failover; increase to lower etcd read load (default: 45s)
alluxio.user.worker.list.refresh.interval=2m
```

After adding or removing workers, clients will reflect the change within one refresh interval. To force immediate propagation during incident recovery, restart the client-side process or reduce this interval temporarily.

## 3. Worker Lifecycle on the Ring

Alluxio's decentralized architecture relies on workers that are managed via a consistent hash ring. This section covers operational procedures for workers joining, leaving, and restarting on the ring.

### Checking Worker Status

To see a list of all registered workers and their current status:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio info nodes
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio info nodes
```

{% endtab %}
{% endtabs %}

```console
WorkerID                              Host               Status
e87b4097-b7f8-48a4-a388-3197124ebbe7  ip-172-16-10-12    ONLINE
a1f23c45-d678-90ab-cdef-123456789012  ip-172-16-10-13    ONLINE
```

#### Diagnosing Hash Ring Bloat from OFFLINE Entries

A healthy cluster should show one `ONLINE` entry per configured worker. If `alluxio info nodes` shows more total entries than expected, or the `ONLINE` count is lower than the worker count, stale `OFFLINE` entries are accumulating in etcd.

**Impact** (static mode): In static mode (`alluxio.user.dynamic.consistent.hash.ring.enabled=false`), `OFFLINE` entries remain in the ring indefinitely. With stale entries present, a proportional fraction of hash lookups land on `OFFLINE` nodes and fall through to UFS — turning cache hits into direct S3/GCS reads at native object-store speeds.

In dynamic mode (default), `OFFLINE` entries are automatically removed from the ring after `alluxio.worker.failure.detection.timeout` (see [Worker Liveness Detection](#worker-liveness-detection)), so ring bloat does not persist. However, each restart with a new UUID still remaps virtual nodes, making previously cached data temporarily unreachable.

Verify the ring is healthy — ONLINE count should equal your configured worker count:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio info nodes | grep -c ONLINE
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio info nodes | grep -c ONLINE
```

{% endtab %}
{% endtabs %}

To remove a stale `OFFLINE` entry from etcd, get its UUID from the `alluxio info nodes` output and run:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio process remove-worker -n <STALE_WORKER_UUID>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio process remove-worker -n <STALE_WORKER_UUID>
```

{% endtab %}
{% endtabs %}

### Adding or Removing Workers

New workers register in etcd and join the hash ring automatically at startup. Removing a worker redistributes its hash ring portion to remaining workers, causing a temporary increase in cache misses while data is re-fetched from UFS.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Adjust the `worker.count` in `alluxio-cluster.yaml` and apply — see [Scaling the Cluster](/ee-ai-en/ai-3.8-15.1.x/administration/managing-alluxio.md#scaling-the-cluster) for the full procedure.
{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Start a new worker process on the target host (docker: `docker run ... alluxio/alluxio-enterprise worker`; bare-metal: `bin/alluxio process start worker`). To remove, stop the worker process; see [Removing a Worker Permanently](#removing-a-worker-permanently) for the deregistration step.
{% endtab %}
{% endtabs %}

### Removing a Worker Permanently

When decommissioning a node, stop the worker first and then explicitly deregister it from etcd so its entry does not remain as a stale `OFFLINE` node.

**Step 1: Stop the worker.**

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Scale down the worker count in `alluxio-cluster.yaml` and apply:

```shell
kubectl apply -f alluxio-cluster.yaml
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}
Stop the worker process on the decommissioned host (docker: `docker stop alluxio-worker`; bare-metal: `bin/alluxio process stop worker`).
{% endtab %}
{% endtabs %}

**Step 2: Deregister the worker from etcd.** Get the worker's UUID from `alluxio info nodes`, then run `alluxio process remove-worker`:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio info nodes
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio process remove-worker -n <WORKER_UUID>
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio info nodes
bin/alluxio process remove-worker -n <WORKER_UUID>
```

{% endtab %}
{% endtabs %}

In dynamic mode, skipping Step 2 is usually safe — the `OFFLINE` entry is automatically purged after `alluxio.worker.failure.detection.timeout`. In static mode, Step 2 is required to prevent the stale entry from staying in the hash ring indefinitely.

### Restarting a Worker

When a worker restarts, it is temporarily marked as offline. With identity persistence configured, the worker rejoins with the same UUID — preserving its ring position, cached data, and load distribution.

Without identity persistence, each restart generates a new UUID with different ring slots, causing cache misses and potential ring bloat. See [Configuring the Hash Ring Mode](#configuring-the-hash-ring-mode) and [Diagnosing Hash Ring Bloat from OFFLINE Entries](#diagnosing-hash-ring-bloat-from-offline-entries) for details.

#### Persisting Worker Identity

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
Set `worker.systemInfo.hostPath` in `alluxio-cluster.yaml` before first deployment:

```yaml
spec:
  worker:
    useExternalId: false
    systemInfo:
      hostPath: /mnt/alluxio/system-info
```

{% endtab %}

{% tab title="Docker" %}
Pre-create an empty file on the host before first start, then mount it as a volume. Alluxio writes the UUID into it on startup, and it survives container recreation:

```shell
sudo mkdir -p /etc/alluxio
sudo touch /etc/alluxio/worker_identity
```

Add to the worker `docker run` command:

```shell
-v /etc/alluxio/worker_identity:/opt/alluxio/conf/worker_identity
```

{% hint style="warning" %}
Do **not** use `-v` without pre-creating the file on the host first. If the host path does not exist, Docker creates a directory at that path instead of a file, causing Alluxio to fail with `IOException: Is a directory`.
{% endhint %}

For the full Docker setup guide, see [Appendix B: Worker Identity](https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/pages/0buMugvlVqxhUEBHLQwi#b.-worker-identity).
{% endtab %}

{% tab title="Bare-Metal" %}
Set `alluxio.worker.identity.uuid.file.path` to a path that survives reboots:

```properties
alluxio.worker.identity.uuid.file.path=/etc/alluxio/worker_identity
```

The file is created automatically on first start. To manually pin an existing worker's UUID (e.g., when migrating to a persistent path), first get the current UUID from `alluxio info nodes`, then write it to the configured path.
{% endtab %}
{% endtabs %}

{% hint style="warning" %}
If a worker is migrated to a different host, copy the identity file to the same path on the new host before starting the worker. Without it, the worker registers with a new UUID and all cached data from the previous identity becomes unreachable.
{% endhint %}

### Cache Recovery After Worker Restart

**With identity persistence configured**, a restarted worker rejoins the hash ring under the same UUID and its previously cached pages remain accessible. However, the restarted worker may lack data that was **loaded onto other workers while it was offline** — during the outage those files were routed to workers that temporarily held this worker's ring slots, so the recovered worker has never cached them. Once it rejoins, those files hash back to it but are absent from its local cache.

To fill those gaps without redundant re-loading:

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Confirm the worker has rejoined the ring
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio info nodes | grep ONLINE
# Expected: N entries all showing ONLINE (N = configured worker count)

# Re-trigger loading — skips files already cached on any worker
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists

# Monitor until SUCCEEDED
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Confirm the worker has rejoined the ring
bin/alluxio info nodes | grep ONLINE
# Expected: N entries all showing ONLINE (N = configured worker count)

# Re-trigger loading
bin/alluxio job load --path <ufs-or-alluxio-path> --submit --skip-if-exists

# Monitor until SUCCEEDED
bin/alluxio job load --path <ufs-or-alluxio-path> --progress
```

{% endtab %}
{% endtabs %}

`--skip-if-exists` ensures that files already cached on healthy workers are not re-fetched from UFS. Only the files that now map to the recovered worker but are not yet cached there will be loaded.

**Without identity persistence**, the worker registers with a new UUID, occupying different ring slots. All data cached under the old UUID becomes unreachable, and a full reload is required to restore cache coverage. This is why identity persistence is a production requirement — see [Persisting Worker Identity](#persisting-worker-identity).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/administration/managing-ring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
