I/O Resiliency

The document describes how Alluxio handles unexpected or expected downtime, such as when the Alluxio workers become unavailable. Multiple mechanisms are in place to handle application requests gracefully and ensure the overall I/O resiliency of the Alluxio system.

UFS fallback mechanism

Alluxio clients read both the metadata and data from Alluxio workers. If the requested worker becomes unavailable or returns with an error, the client can fallback to requesting this information from the UFS instead.

The UFS fallback feature is enabled by default. This may cause undesirable behavior in certain situations, such as potentially overloading the UFS with a large number of requests, referred to as the thundering herd problem.

To disable the feature, set the following property:

alluxio.dora.client.ufs.fallback.enabled=false

Membership changes

Adding workers

When a worker is added to the cluster, the newly added worker will take over some hash range from existing workers according to the consistent hashing algorithm. Clients will still read files from the existing set of workers until the clients detect that the membership has changed. In the case that requests are directed to the new worker, read operations may slow down since the worker would need to load the data again from UFS, but the operation should not encounter any errors.

Removing workers

When a worker is removed, the paths previously directed to the removed worker will be distributed to other workers, again according to the consistent hashing algorithm. Ongoing requests served by the removed worker will fail. If the UFS fallback mechanism is enabled, the client will attempt to retry the request directly to the UFS. Once the client detects the membership change, these requests will be properly directed to the remaining live workers.

For S3 API interface, the application might see transient read errors for ongoing reads, but should be able to recover once the worker membership is updated.

Restarting workers

Restarting a worker behaves in the same way as removing a worker and subsequently adding it back. Because each worker has a unique identity, which is used to form the consistent hash ring, the restarted worker will be responsible for the same paths it had been previously managing, such that the cached data on the worker will remain valid.

Last updated 5 months ago