Multiple Replicas
Overview
A copy of the file's data and metadata on an Alluxio worker node is called a replica. File replication allows multiple workers to serve the same file for clients by creating multiple replicas at different workers. Therefore, file replication is used to increase I/O performance when a dataset is cached and accessed concurrently by a large number of clients.
If file replication is not enabled, a file will only have one replica in the cluster at a time. This means all access from clients to the file's metadata and data will be served by the single worker where the replica is stored. In cases where this file needs to be accessed by a large number of clients concurrently, the service capacity of that worker may become a bottleneck.
Note that for a given file, a worker will have at most 1 replica of the file. A worker will not have multiple replicas of the same file.
Configuring the replication level
Before replicas are created, you need to first configure the level of replication of files, i.e. how many copies of a file exist in a cluster. The level of replication can be managed via a RESTful API endpoint on the coordinator:
Listing the current replica rules
To list all the current replica rules in effect, send a GET request to the endpoint:
The response is a JSON object containing all the replica rules:
This example show a replica rule set on the root path and all files having 1 replica.
Creating a new replica rule
To add a new replica rule, send a POST request with the following command:
The POST data of the request is a JSON object:
The path
field specifies the path prefix for which the replica rule to take effect. The numReplicasPerCluster
field
specifies the level of replication for files under this path prefix.
Multiple replica rules can be specified for different path prefixes. For example, the following example sets files
under /s3/vip_data
to have 3 replicas and /s3/temp_data
to have 1 replica:
Note that currently if multiple replica rules are created, the paths of the rules must not be nested. For example,
a rule set on /parent
and a rule set on /parent/child
cannot exist at the same time. Consequently, a rule set
on the root path /
cannot co-exist with rules set on any other paths.
This limitation will be lifted in a future release.
Updating an existing replica rule
Updating an existing rule can be done in the same way as creating a new rule, but on the path of an existing rule:
The above example updates the number of replicas on /s3/vip_data
from 3 to 1.
Removing a replica rule
To remove a replica rule, send a DELETE request to the endpoint, with the path prefix of the rule to delete as the request body:
Default replication level
If no replica rules have been created, the default behavior is as if a rule of 1 replica is set on the root path:
As a result, all files will have only 1 replica across the Alluxio cluster.
The deprecated configuration property alluxio.user.file.replication.min
can be used to change the number of default replication level, e.g. to 2:
We recommend setting an explicit rule on the root path rather than relying on this deprecated property.
Creating Replicas
Replicas can be created either passively as a result of cache miss or manually via distributed load.
Creating replicas manually
Creating replicas passively
When a client reads a file from Alluxio, the file is loaded into Alluxio if it is not cached by any workers. If file replication is enabled, clients will try to load the file from different workers, resulting in multiple replicas to be created passively. Note that the number of replicas created as a result of a passive creation will depend on how many clients loaded the file. Since one client will only read the file from one worker at a time, if the file is only requested by one single client, there will be only one replica created on one worker. If the number of clients are large enough, then the number of replicas created passively will likely reach the specified replication level set in the replica rules. This differs from replicas created manually via a distributed load, which actively loads as many replicas as specified in the replica rule.
Reading from Multiple Replicas Concurrently
A client will be able to read multiple replicas if the file has more than 1 replica according to an effective replica rule. For a multi-replicated file, which replica a client chooses is controlled by the replica selection policy.
A replica selection policy determines which worker the client should reach to read the files, given a list of all candidate workers. Alluxio offers two replica selection policies:
Global fixed: all clients choose workers by the same order as listed in the candidate workers list.
Client fixed: each client sees the same set of candidate workers, but the order of the workers may be different. Therefore, different clients will choose different workers for reading the file.
Random: each client will choose a random replica from the candidate workers every time the file is read.
To set the replica selection policy, set the configuration alluxio.user.replica.selection.policy
to either GLOBAL_FIXED
or CLIENT_FIXED
, or RANDOM
.
If the worker chosen by the replica selection policy is unavailable due to temporary failures, the client will use the next worker in the candidate list until it exhausts all options and fails with an error.
Note that, by default, the client does not check if a worker actually has cached the file before trying to read from that worker. If the worker selected has not cached the file, the worker will fetch and cache the file from UFS before sending it to the client. The client won't switch to another replica worker unless the currently selected worker is unavailable. To improve I/O performance by using already cached replicas, see the next section.
Optimized I/O for multi-replicated files
If a file is replicated on multiple workers, Alluxio supports optimized I/O for faster access by prioritizing workers that already have fully cached the data. When an Alluxio client reads a multi-replicated file, it will first check if the file is cached by any of the replica workers selected by the replica selection policy. If so, the client will try to read from the first worker that has fully cached the file. If none of the workers have fully cached the file, the client will do a cold read using the workers in the same way as described in the previous section.
Note that a worker that has partially cached a file does not get prioritized compared to a worker that does not cached the file at all.
To enable optimized I/O for multi-AZ replicated files, set the following configuration:
Passive cache for auto replica creation
When optimized I/O is enabled, and during the initial check process a client finds that the files it is trying to read is not cached or partially cached on a preferred worker, then the client will switch to a less preferred data source for the file instead of doing a cold read. There are a number of reasons that a preferred worker does not fully cache the file while a less preferred one does:
the cache of the file on the preferred worker has been evicted to make room for more recently accessed files
the cluster is expanded and the preferred worker is newly added to the cluster
Since the client will not trigger a cold read on the preferred worker, the worker will not cache the file unless the user runs an explicit load job to load the file into the worker. This will add a constant, additional I/O latency that is needed to probe and determine whether the preferred worker has cached the file, to every read request. To eliminate the added latency, the user can enable the passive cache feature which automatically and asynchronously loads a file on cache miss. When the client accesses the same file later, the preferred worker will have fully cached the file, so the client will not have to switch to a less preferred worker, optimizing the I/O latency.
To enable passive cache, add the following configuration:
Monitoring replicas via metrics
When alluxio.user.replica.prefer.cached.replicas
is enabled, the metric alluxio_multi_replica_read_from_workers_bytes_total
monitors data read patterns across clusters. This metric provides insight into read traffic distribution and the effectiveness of multi-replica caching under high availability configurations.
Type:
counter
Component:
client
Description: Tracks the total number of bytes read by a client from Alluxio workers when reading multi-replica files. This metric only takes effect when
alluxio.user.replica.prefer.cached.replicas
is set totrue
. It helps analyze how often the client reads from local versus remote clusters and whether the reads come from fully cached replicas.Labels:
cluster_name
: The name of the cluster where the worker serving the read resides.local_cluster
:true
if the worker belongs to the same cluster as the client,false
otherwise.hot_read
:true
if the worker has fully cached the file being read,false
if the worker has not cached or has only partially cached the replica.
When passive cache is enabled, the metric alluxio_passive_cache_async_loaded_files
exists to monitor the number of files that are loaded by the passive cache feature.
Type:
counter
Component:
worker
Description: Tracks the total number of files that are loaded by a worker triggered by passive cache. This metric only takes effect when
alluxio.user.file.passive.cache.enabled
is set totrue
. It helps analyze how many files encountered a cache miss on a preferred worker and therefore needs to be loaded by passive cache.Labels:
result
: the result of loading the file, one of "submitted", "success" and "failure". When the files is submitted for async loading, the metric labelled "submitted" is incremented. When the load completes either successfully or unsuccessfully, the metric labelled "success" or "failure" is incremented, respectively.
Limitations
Consistency issue for mutable data
Multiple replicas work best with immutable data since there won't be inconsistency between the replicas. If the data in the UFS is mutable and can change over time, replicas loaded from UFS at different times may be different versions of the file. Thus, inconsistency may arise if two clients choose two different replicas of the same file.
To avoid potential inconsistency when the files are mutable, you can
Manually refresh the file metadata of the replicas and invalidate the existing cached replicas.
This command will launch a distributed job to load the metadata of the file s3://bucket/file
.
If there are any existing replicas of the file and Alluxio detects that the file has changed in the UFS, the existing replicas will be invalidated. When the file is subsequently accessed, Alluxio will load the latest data from the UFS.
For example, if a file is expected to be updated once in 10 minutes on average, then a 10-minute maxAge
policy can be used:
When a replica has a max age of 10 minutes, it will be invalidated and reloaded from the UFS after 10 minutes. Every time an update is made to the file in the UFS, wait at least 10 minutes so that any existing replicas will be invalidated.
Level of replication is not actively maintained
The target replication level specified by the replica rules serves only as an input argument to the replication selection algorithm. The actual number of replicas present in a cluster at any given moment can be less than, equal to, or greater than the value of the configuration. Alluxio does not actively monitor the number of replicas and try to maintain the target replication level.
The number of actual replicas can be less than the target replication level due to a number of reasons. For example, if a file was cached on some workers at a moment, but was evicted from some of the workers because of insufficient cache capacity, then its actual replication level will be less than the target. Another case is if replicas are created passively as a result of a cache miss and a client finds that a file is already cached by a worker, then the file will be served from the cache directly and no new replica is created. The file's replication level will remain unchanged until a client reaches a worker which has not cached the file.
Last updated