File Replication

Overview

A copy of a file's data and metadata on an Alluxio worker node is called a replica. File replication allows multiple workers to serve the same file by creating and distributing multiple replicas across different workers. This feature is primarily used to increase I/O throughput and availability, especially when a dataset is cached and accessed concurrently by a large number of clients.

If file replication is not enabled (i.e., the replication level is set to 1), a file will only have one replica in the cluster at a time. This means all client access to the file's data will be served by the single worker where the replica is stored. In cases where this file needs to be accessed by many clients concurrently, that single worker can become a performance bottleneck.

Note: A worker will have at most one replica of any given file. It will not store multiple copies of the same file.

Enabling File Replication

File replication is controlled by several system settings that determine how many replicas are created and how they are accessed.

Number of Replicas

By default, the replication level is 1, meaning each file has only one replica. To increase this, you have two options:

Create Replica Rules (Recommended): This approach offers granular control by allowing you to define replication levels for specific paths. For example, a rule can set the replica count to 3 for all files under a given path.
Set a Cluster-Wide Default: Use the deprecated alluxio.user.file.replication.min property to set a default replication level for the entire cluster. This method is less flexible than using replica rules.

Replica Selection Policy

When reading a file with multiple replicas, the replica selection policy determines which worker a client reads from. This is key for load balancing and ensuring high availability. If the chosen worker is unavailable, the client automatically fails over to another replica.

The policy is configured using the alluxio.user.replica.selection.policy property, with the following options available:

GLOBAL_FIXED (Default): All clients read from workers in the same fixed order.
CLIENT_FIXED: Each client has its own fixed order of workers, distributing the load across replicas.
RANDOM: Each client chooses a random replica for each read request.

To configure the policy, set the following property:

alluxio.user.replica.selection.policy=GLOBAL_FIXED

Optimizing Replica Reading Performance

By default, a client follows the replica selection policy without considering if the data is already cached on the selected worker, which can lead to unnecessary cold reads. To optimize this, you can enable two key features: cached replica prioritization and passive caching.

Cached Replica Prioritization: The client first checks which of the available replica workers have the file fully cached. It prioritizes reading from one of these "hot" workers to avoid a slow cold read from the UFS.
Passive Caching: This feature complements replica prioritization. If the client's most preferred worker does not have the data cached but another worker does, the client will read from the cached source while simultaneously triggering an asynchronous background job to load the file onto the preferred worker. This ensures data becomes cached on the ideal worker for future reads.

To enable both features, set the following properties:

alluxio.user.replica.prefer.cached.replicas=true
alluxio.user.file.passive.cache.enabled=true

Note: A worker with a partially cached file is not prioritized over a worker with no cache at all.

Configuring Replication Rules

The number of replicas for files is managed through replica rules. These rules are configured via a REST API endpoint on the coordinator.

API Endpoint: http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

Creating a Replica Rule

To add a new rule, send a POST request with a JSON body specifying the path and numReplicasPerCluster.

The following command sets the replication level to 2 for all files under the root path (/):

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/", "numReplicasPerCluster": 2}' http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

You can create multiple rules for different paths. For example, to set 3 replicas for /s3/vip_data and 1 replica for /s3/temp_data:

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data", "numReplicasPerCluster": 3}' http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule
$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/temp_data", "numReplicasPerCluster": 1}' http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

Limitation on Nested Paths: Currently, replica rule paths cannot be nested. For example, you cannot have a rule for /parent and another for /parent/child simultaneously. This also means a rule on the root path (/) cannot coexist with any other rule. This limitation will be lifted in a future release.

Listing Replica Rules

To see all active replica rules, send a GET request to the endpoint:

$ curl -G http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

The response is a JSON object containing all rules. The following example shows a default rule on the root path (/) with a replication level of 1.

{
  "/": {
    "pathPrefix": "/",
    "clusterReplicaMap": {
      "DefaultAlluxioCluster": 1
    },
    "totalReplicas": 1,
    "type":"UNIFORM"
  }
}

Updating a Replica Rule

To update an existing rule, send a POST request with the same path but a new numReplicasPerCluster value. The following example updates the rule for /s3/vip_data to 1 replica:

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data", "numReplicasPerCluster": 1}' http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

Removing a Replica Rule

To remove a rule, send a DELETE request with the path of the rule to delete in the JSON body:

$ curl -X DELETE -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data"}' http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

Creating Replicas

Replicas can be created in two ways: actively via a distributed load job or passively when clients read data.

Creating Replicas Actively

The most reliable way to create a specific number of replicas is to use the job load command. This command launches a distributed job that loads the file into Alluxio and creates the number of replicas specified by the matching replica rule.

For example, if a rule for /s3 is set to 3 replicas, the following command will load /s3/file with 3 replicas on 3 different workers:

bin/alluxio job load --path s3://bucket/file --submit

The load job will report as successful even if some replicas fail to load. Failed replicas will be created passively upon the next client request.

Creating Replicas Passively

Replicas can be created passively in two scenarios: on a cold read or through passive caching.

On Cold Read When a client reads a file that is not cached in Alluxio, it triggers a "cold read," and the file is loaded into the cache of the worker serving the request, creating a replica. If subsequent client requests for the same file are directed to different workers (based on the replica selection policy), additional replicas will be created passively over time. The number of replicas created this way is not guaranteed and depends on client access patterns.
With Passive Caching When passive caching is enabled, reading a file can trigger asynchronous replica creation. If a client's preferred worker does not have the data, the client reads from another source while an asynchronous background job loads the file to the preferred worker. This creates a new replica on the preferred worker, ensuring better data locality for future reads.

Monitoring

Alluxio provides metrics to monitor the behavior of file replication and passive caching.

Read Traffic Metrics

When alluxio.user.replica.prefer.cached.replicas is enabled, the alluxio_multi_replica_read_from_workers_bytes_total metric tracks data read patterns.

Type: counter
Component: client
Description: Total bytes read by a client from workers for multi-replica files. It helps analyze read distribution and cache effectiveness.
Labels:
- cluster_name: The cluster where the serving worker resides.
- local_cluster: true if the worker is in the same cluster as the client.
- hot_read: true if the worker had a fully cached replica.

Passive Caching Metrics

When passive caching is enabled, the alluxio_passive_cache_async_loaded_files metric tracks files loaded asynchronously.

Type: counter
Component: worker
Description: Total number of files loaded by a worker triggered by passive cache.
Labels:
- result: The result of the load job (submitted, success, or failure).

Limitations

Consistency with Mutable Data

File replication works best with immutable data. If the underlying data in the UFS is mutable, different replicas loaded at different times may become inconsistent.

To mitigate this, you can:

Manually Refresh Metadata: Use the job load command with the --metadataOnly flag to check for UFS updates and invalidate stale replicas.
```
bin/alluxio job load --path s3://bucket/file --metadataOnly --submit
```
Use a Cache Filter Policy: Set a maxAge policy to automatically invalidate replicas after a certain period, forcing them to be reloaded from the UFS. For example, a 10-minute maxAge:
```
{
  "apiVersion": 1,
  "data": { "defaultType": "maxAge", "defaultMaxAge": "10min" },
  "metadata": { "defaultType": "maxAge", "defaultMaxAge": "10min" }
}
```

Replication Level is Not Actively Maintained

Alluxio does not actively add or remove replicas to match the target replication level set in the rules. The actual number of replicas can be lower or higher than the target due to:

Cache Eviction: Replicas may be evicted from workers due to insufficient cache capacity.
Passive Creation: If a file is always served from an existing replica, new replicas may not be created.
Worker Membership Changes: When workers join or leave the cluster, the replica selection algorithm may change, leading to "orphaned" replicas that are no longer selected but still exist in the cache. These orphaned replicas are eventually evicted.

Last updated 2 months ago