I/O Resiliency

Overview of I/O Resiliency

Alluxio is designed to be highly resilient to common failures in a distributed environment. It has multiple, layered mechanisms to ensure that application I/O requests can continue gracefully even when workers become unavailable. This ensures that unexpected downtime or routine maintenance does not disrupt your data workloads.

The primary resiliency mechanisms are:

UFS Fallback: Automatically read from the underlying storage if a worker is down.
Retry Across Replicas: Try reading from other workers that have a copy of the data.
Multi-AZ Failover: In a multi-cluster deployment, failover to a cluster in another Availability Zone.

Failover Priority: The Order of Operations

When an Alluxio client fails to read data from its preferred worker, it will automatically attempt to recover by trying different sources in the following order:

Retry with Local Replicas: If data replication is enabled, the client will first try to read from other workers in the same cluster that have a replica of the requested data.
Failover to Remote Clusters (Multi-AZ): If all local replicas are unavailable and multi-AZ support is enabled, the client will then try to read from workers in other federated clusters, typically located in different Availability Zones.
Fallback to UFS: If all Alluxio workers (both local and remote) are unavailable, the client will fall back to reading the data directly from the Under File System (UFS) as a last resort.

This layered approach maximizes the chance of a successful read while prioritizing the fastest available data source.

How Worker Lifecycle Events Affect I/O

The decentralized architecture is designed to handle dynamic changes in cluster membership.

When a Worker is Added: The new worker joins the consistent hash ring and takes over a portion of the hash range. Read operations for data in its range may initially be slow as it loads data from the UFS, but the operations will succeed.
When a Worker is Removed or Fails: Ongoing requests to the removed worker will fail. The client will then trigger the failover process described above. For FUSE clients, this process is seamless. For S3 or FSSpec clients, the application might see a transient error but should recover on retry once the client's view of the worker list is updated.
When a Worker is Restarted: This is treated as a worker removal followed by an addition. As long as the worker's identity is persisted, it will rejoin the ring and reclaim its original hash range, making its cached data available again.

Configuration

The UFS fallback feature is enabled by default. In some cases, such as to prevent a "thundering herd" of requests to your UFS, you may want to disable it.

Property

Description

Default

alluxio.dora.client.ufs.fallback.enabled

If true, the client will fall back to reading from the UFS if it cannot read from an Alluxio worker.

true

Last updated 2 days ago