File Replication

Overview

A copy of the file's data and metadata on an Alluxio worker node is called a replica. File replication allows multiple workers to serve the same file for clients by creating multiple replicas at different workers. Therefore, file replication is used to increase I/O performance when a dataset is cached and accessed concurrently by a large number of clients.

If file replication is not enabled, a file will only have one replica in the cluster at a time. This means all access from clients to the file's metadata and data will be served by the single worker where the replica is stored. In cases where this file needs to be accessed by a large number of clients concurrently, the service capacity of that worker may become a bottleneck.

Creating Replicas

Replicas can be created either passively as a result of cache miss or manually via distributed load.

Creating replicas manually

Alluxio offers the ability to distributed load files into Alluxio's cache. The number of replicas can be specified when submitting the load job:

bin/alluxio job load --path s3://bucket/file --replicas 3 --submit

In the above example, the file located at s3://bucket/file is loaded into Alluxio with 3 replicas, which will reside on 3 different workers.

Note that the number of replicas set for the --replicas option should not exceed the configuration value for alluxio.user.file.replication.min.

Creating replicas passively

When a client reads a file from Alluxio, the file is loaded into Alluxio if it is not cached by any workers. If file replication is enabled, clients will try to load the file from different workers, resulting in multiple replicas to be created passively. The number of workers that the clients will use to load the file can be specified by the configuration property alluxio.user.file.replication.min. Set the property in conf/alluxio-site.properties to a value greater than 1, but less than the number of workers in the cluster.

Reading from Multiple Replicas Concurrently

A client will be able to read multiple replicas in the case that alluxio.user.file.replication.min is set to value greater than 1 and multiple replicas reside among the workers. If the replicas were created proactively via distributed load, the number of replicas should be set to be the same as the value used for the distributed load job. How the client chooses a replica is controlled by the replica selection policy.

A replica selection policy determines which worker the client should reach to read the files, given a list of all candidate workers. Alluxio offers two replica selection policies:

  • Global fixed: all clients choose workers by the same order as listed in the candidate workers list.

  • Client fixed: each client sees the same set of candidate workers, but the order of the workers may be different. Therefore, different clients will choose different workers for reading the file.

If the worker chosen by the replica selection policy is unavailable, the client will use the next worker in the candidate list until it exhausts all options and fails with an error.

To set the replica selection policy, set the configuration alluxio.user.replica.selection.policy to either GLOBAL_FIXED or CLIENT_FIXED.

Limitations

There are a few limitations with the file replication feature:

Consistency issue for mutable data

Multiple replicas work best with immutable data since there won't be inconsistency between the replicas. If the data in the UFS is mutable and can change over time, replicas loaded from UFS at different times may be different versions of the file. Thus, inconsistency may arise if two clients choose two different replicas of the same file.

To avoid potential inconsistency when the files are mutable, you can

  • Manually refresh the file metadata of the replicas and invalidate the existing cached replicas.

This can be done with the distributed load tool with the --metadataOnly option:

bin/alluxio job load --path s3://bucket/file --replicas 3 --metadataOnly --submit

This command will launch a distributed job to load the metadata of the file s3://bucket/file. If there are any existing replicas of the file and Alluxio detects that the file has changed in the UFS, the existing replicas will be invalidated. When the file is subsequently accessed, Alluxio will load the latest data from the UFS.

  • Set a cache filter with an appropriate cache control policy for the files.

For example, if a file is expected to be updated once in 10 minutes on average, then a 10-minute maxAge policy can be used:

{
  "apiVersion": 1,
  "data": {
    "defaultType": "maxAge",
    "defaultMaxAge": "10min"
  },
  "metadata": {
    "defaultType": "maxAge",
    "defaultMaxAge": "10min"
  }
}

When a replica has a max age of 10 minutes, it will be invalidated and reloaded from the UFS after 10 minutes. Every time an update is made to the file in the UFS, wait at least 10 minutes so that any existing replicas will be invalidated.

Level of replication is not actively maintained

The target replication level specified by alluxio.user.file.replication.min serves only as an input argument to the replication selection algorithm. The actual number of replicas present in a cluster at any given moment can be less than, equal to, or greater than the value of the configuration. Alluxio does not actively monitor the number of replicas and try to maintain the target replication level.

The number of actual replicas can be less than the target replication level due to a number of reasons. For example, if a file was cached on some workers at a moment, but was evicted from some of the workers because of insufficient cache capacity, then its actual replication level will be less than the target. Another case is if replicas are created passively as a result of a cache miss and a client finds that a file is already cached by a worker, then the file will be served from the cache directly and no new replica is created. The file's replication level will remain unchanged until a client reaches a worker which has not cached the file.

The number of actual replicas can be greater than the target when the worker membership changes. Workers joining or exiting the cluster may alter the result of the replication selection algorithm, such that a worker containing a replica ends up is never selected, resulting in an inaccessible replica. This replica will linger in the worker's cache until it is evicted by the worker due to insufficient cache capacity.

Last updated