Alluxio
ProductsLanguageHome
AI-3.6 (stable)
AI-3.6 (stable)
  • Overview
    • Alluxio Namespace and Under File System
    • Worker Management and Consistent Hashing
    • Multi Tenancy and Unified Management
    • I/O Resiliency
  • Getting Started with K8s
    • Resource Prerequisites and Compatibility
    • Installation
      • Install on Kubernetes
      • Handling Images
      • Advanced Configuration
      • License
    • Monitoring and Metrics
    • Management Console
      • Deployment
      • Navigation
      • User Roles & Access Control
    • Cluster Administration
    • System Health Check & Quick Recovery
    • Diagnostic Snapshot
  • Storage Integrations
    • Amazon AWS S3
    • Google Cloud GCS
    • Azure Blob Store
    • Aliyun OSS
    • Tencent COS
    • Volcengine TOS
    • Baidu Object Storage
    • HDFS
    • Network Attached Storage (NAS)
  • Data Access
    • Access via FUSE (POSIX API)
      • Client Writeback
      • Client Virtual Path Mapping
    • Access via S3 API
    • Access via PythonSDK/FSSpec
    • Data Access High Availability
      • Multiple Replicas
      • Multiple Availability Zones (AZ)
    • Performance Optimizations
      • File Reading
      • File Writing
      • Metadata Listing
    • UFS Bandwidth Limiter
  • Cache Management
    • Cache Filter Policy
    • Cache Loading
    • Cache Eviction
      • Manual Eviction by Free Command
      • Auto Eviction by TTL Policy
      • Auto Eviction by Priority Policy
    • Stale Cache Cleaning
    • Cache Quota
  • Performance Benchmarks
    • Fio (POSIX) Benchmark
    • COSBench (S3) Benchmark
    • MLPerf Storage Benchmark
  • Security
    • TLS Support
  • Reference
    • User CLI
    • Metrics
    • S3 API Usage
    • Third Party Licenses
  • Release Notes
Powered by GitBook
On this page
  • Overview
  • Configuring the replication level
  • Listing the current replica rules
  • Creating a new replica rule
  • Updating an existing replica rule
  • Removing a replica rule
  • Default replication level
  • Creating Replicas
  • Creating replicas manually
  • Creating replicas passively
  • Reading from Multiple Replicas Concurrently
  • Optimized I/O for multi-replicated files
  • Passive cache for auto replica creation
  • Monitoring replicas via metrics
  • Limitations
  • Consistency issue for mutable data
  • Level of replication is not actively maintained
  1. Data Access
  2. Data Access High Availability

Multiple Replicas

Overview

A copy of the file's data and metadata on an Alluxio worker node is called a replica. File replication allows multiple workers to serve the same file for clients by creating multiple replicas at different workers. Therefore, file replication is used to increase I/O performance when a dataset is cached and accessed concurrently by a large number of clients.

If file replication is not enabled, a file will only have one replica in the cluster at a time. This means all access from clients to the file's metadata and data will be served by the single worker where the replica is stored. In cases where this file needs to be accessed by a large number of clients concurrently, the service capacity of that worker may become a bottleneck.

Note that for a given file, a worker will have at most 1 replica of the file. A worker will not have multiple replicas of the same file.

Configuring the replication level

Before replicas are created, you need to first configure the level of replication of files, i.e. how many copies of a file exist in a cluster. The level of replication can be managed via a RESTful API endpoint on the coordinator:

http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

Listing the current replica rules

To list all the current replica rules in effect, send a GET request to the endpoint:

$ curl -G http://<coordinator_host>:<coordinator_api_port>/api/v1/replica-rule

The response is a JSON object containing all the replica rules:

{
  "/": {
    "pathPrefix": "/",
    "clusterReplicaMap": {
      "DefaultAlluxioCluster": 1
    },
    "totalReplicas": 1,
    "type":"UNIFORM"
  }
}

This example show a replica rule set on the root path and all files having 1 replica.

Creating a new replica rule

To add a new replica rule, send a POST request with the following command:

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/", "numReplicasPerCluster": 2}' http://<coordinator_host>:<coordinator_api_port>/replica-rule

The POST data of the request is a JSON object:

{
  "path": "/",
  "numReplicasPerCluster": 2
}

The path field specifies the path prefix for which the replica rule to take effect. The numReplicasPerCluster field specifies the level of replication for files under this path prefix.

Multiple replica rules can be specified for different path prefixes. For example, the following example sets files under /s3/vip_data to have 3 replicas and /s3/temp_data to have 1 replica:

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data", "numReplicasPerCluster": 3}' http://<coordinator_host>:<coordinator_api_port>/replica-rule
$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/temp_data", "numReplicasPerCluster": 1}' http://<coordinator_host>:<coordinator_api_port>/replica-rule

Note that currently if multiple replica rules are created, the paths of the rules must not be nested. For example, a rule set on /parent and a rule set on /parent/child cannot exist at the same time. Consequently, a rule set on the root path / cannot co-exist with rules set on any other paths. This limitation will be lifted in a future release.

Updating an existing replica rule

Updating an existing rule can be done in the same way as creating a new rule, but on the path of an existing rule:

$ curl -X POST -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data", "numReplicasPerCluster": 1}' http://<coordinator_host>:<coordinator_api_port>/replica-rule

The above example updates the number of replicas on /s3/vip_data from 3 to 1.

Removing a replica rule

To remove a replica rule, send a DELETE request to the endpoint, with the path prefix of the rule to delete as the request body:

$ curl -X DELETE -H 'Content-Type: application/json' -d '{"path": "/s3/vip_data"}' http://<coordinator_host>:<coordinator_api_port>/replica-rule

Default replication level

If no replica rules have been created, the default behavior is as if a rule of 1 replica is set on the root path:

{
  "path": "/",
  "numReplicasPerCluster": 1
}

As a result, all files will have only 1 replica across the Alluxio cluster.

The deprecated configuration property alluxio.user.file.replication.min can be used to change the number of default replication level, e.g. to 2:

alluxio.user.file.replication.min=2

We recommend setting an explicit rule on the root path rather than relying on this deprecated property.

Creating Replicas

Replicas can be created either passively as a result of cache miss or manually via distributed load.

Creating replicas manually

bin/alluxio job load --path s3://bucket/file --submit

Creating replicas passively

When a client reads a file from Alluxio, the file is loaded into Alluxio if it is not cached by any workers. If file replication is enabled, clients will try to load the file from different workers, resulting in multiple replicas to be created passively. Note that the number of replicas created as a result of a passive creation will depend on how many clients loaded the file. Since one client will only read the file from one worker at a time, if the file is only requested by one single client, there will be only one replica created on one worker. If the number of clients are large enough, then the number of replicas created passively will likely reach the specified replication level set in the replica rules. This differs from replicas created manually via a distributed load, which actively loads as many replicas as specified in the replica rule.

Reading from Multiple Replicas Concurrently

A client will be able to read multiple replicas if the file has more than 1 replica according to an effective replica rule. For a multi-replicated file, which replica a client chooses is controlled by the replica selection policy.

A replica selection policy determines which worker the client should reach to read the files, given a list of all candidate workers. Alluxio offers two replica selection policies:

  • Global fixed: all clients choose workers by the same order as listed in the candidate workers list.

  • Client fixed: each client sees the same set of candidate workers, but the order of the workers may be different. Therefore, different clients will choose different workers for reading the file.

  • Random: each client will choose a random replica from the candidate workers every time the file is read.

To set the replica selection policy, set the configuration alluxio.user.replica.selection.policy to either GLOBAL_FIXED or CLIENT_FIXED, or RANDOM.

If the worker chosen by the replica selection policy is unavailable due to temporary failures, the client will use the next worker in the candidate list until it exhausts all options and fails with an error.

Note that, by default, the client does not check if a worker actually has cached the file before trying to read from that worker. If the worker selected has not cached the file, the worker will fetch and cache the file from UFS before sending it to the client. The client won't switch to another replica worker unless the currently selected worker is unavailable. To improve I/O performance by using already cached replicas, see the next section.

Optimized I/O for multi-replicated files

If a file is replicated on multiple workers, Alluxio supports optimized I/O for faster access by prioritizing workers that already have fully cached the data. When an Alluxio client reads a multi-replicated file, it will first check if the file is cached by any of the replica workers selected by the replica selection policy. If so, the client will try to read from the first worker that has fully cached the file. If none of the workers have fully cached the file, the client will do a cold read using the workers in the same way as described in the previous section.

Note that a worker that has partially cached a file does not get prioritized compared to a worker that does not cached the file at all.

To enable optimized I/O for multi-AZ replicated files, set the following configuration:

alluxio.user.replica.prefer.cached.replicas=true

Passive cache for auto replica creation

When optimized I/O is enabled, and during the initial check process a client finds that the files it is trying to read is not cached or partially cached on a preferred worker, then the client will switch to a less preferred data source for the file instead of doing a cold read. There are a number of reasons that a preferred worker does not fully cache the file while a less preferred one does:

  • the cache of the file on the preferred worker has been evicted to make room for more recently accessed files

  • the cluster is expanded and the preferred worker is newly added to the cluster

Since the client will not trigger a cold read on the preferred worker, the worker will not cache the file unless the user runs an explicit load job to load the file into the worker. This will add a constant, additional I/O latency that is needed to probe and determine whether the preferred worker has cached the file, to every read request. To eliminate the added latency, the user can enable the passive cache feature which automatically and asynchronously loads a file on cache miss. When the client accesses the same file later, the preferred worker will have fully cached the file, so the client will not have to switch to a less preferred worker, optimizing the I/O latency.

To enable passive cache, add the following configuration:

# multi-replica optimized IO must be enabled for passive cache to work
alluxio.user.replica.prefer.cached.replicas=true
alluxio.user.file.passive.cache.enabled=true

Monitoring replicas via metrics

When alluxio.user.replica.prefer.cached.replicas is enabled, the metric alluxio_multi_replica_read_from_workers_bytes_total monitors data read patterns across clusters. This metric provides insight into read traffic distribution and the effectiveness of multi-replica caching under high availability configurations.

  • Type: counter

  • Component: client

  • Description: Tracks the total number of bytes read by a client from Alluxio workers when reading multi-replica files. This metric only takes effect when alluxio.user.replica.prefer.cached.replicas is set to true. It helps analyze how often the client reads from local versus remote clusters and whether the reads come from fully cached replicas.

  • Labels:

    • cluster_name: The name of the cluster where the worker serving the read resides.

    • local_cluster: true if the worker belongs to the same cluster as the client, false otherwise.

    • hot_read: true if the worker has fully cached the file being read, false if the worker has not cached or has only partially cached the replica.

When passive cache is enabled, the metric alluxio_passive_cache_async_loaded_files exists to monitor the number of files that are loaded by the passive cache feature.

  • Type: counter

  • Component: worker

  • Description: Tracks the total number of files that are loaded by a worker triggered by passive cache. This metric only takes effect when alluxio.user.file.passive.cache.enabled is set to true. It helps analyze how many files encountered a cache miss on a preferred worker and therefore needs to be loaded by passive cache.

  • Labels:

    • result: the result of loading the file, one of "submitted", "success" and "failure". When the files is submitted for async loading, the metric labelled "submitted" is incremented. When the load completes either successfully or unsuccessfully, the metric labelled "success" or "failure" is incremented, respectively.

Limitations

Consistency issue for mutable data

Multiple replicas work best with immutable data since there won't be inconsistency between the replicas. If the data in the UFS is mutable and can change over time, replicas loaded from UFS at different times may be different versions of the file. Thus, inconsistency may arise if two clients choose two different replicas of the same file.

To avoid potential inconsistency when the files are mutable, you can

  • Manually refresh the file metadata of the replicas and invalidate the existing cached replicas.

bin/alluxio job load --path s3://bucket/file --metadataOnly --submit

This command will launch a distributed job to load the metadata of the file s3://bucket/file. If there are any existing replicas of the file and Alluxio detects that the file has changed in the UFS, the existing replicas will be invalidated. When the file is subsequently accessed, Alluxio will load the latest data from the UFS.

For example, if a file is expected to be updated once in 10 minutes on average, then a 10-minute maxAge policy can be used:

{
  "apiVersion": 1,
  "data": {
    "defaultType": "maxAge",
    "defaultMaxAge": "10min"
  },
  "metadata": {
    "defaultType": "maxAge",
    "defaultMaxAge": "10min"
  }
}

When a replica has a max age of 10 minutes, it will be invalidated and reloaded from the UFS after 10 minutes. Every time an update is made to the file in the UFS, wait at least 10 minutes so that any existing replicas will be invalidated.

Level of replication is not actively maintained

The target replication level specified by the replica rules serves only as an input argument to the replication selection algorithm. The actual number of replicas present in a cluster at any given moment can be less than, equal to, or greater than the value of the configuration. Alluxio does not actively monitor the number of replicas and try to maintain the target replication level.

The number of actual replicas can be less than the target replication level due to a number of reasons. For example, if a file was cached on some workers at a moment, but was evicted from some of the workers because of insufficient cache capacity, then its actual replication level will be less than the target. Another case is if replicas are created passively as a result of a cache miss and a client finds that a file is already cached by a worker, then the file will be served from the cache directly and no new replica is created. The file's replication level will remain unchanged until a client reaches a worker which has not cached the file.

Last updated 4 hours ago

Alluxio offers the ability to files into Alluxio's cache.

If a rule set on /s3 exists, and the number of replicas is set to 3, then when the load finishes, the file located at /s3/file is loaded into Alluxio with 3 replicas, which will reside on 3 different workers. At the conclusion of the load job with multiple replicas, the load job status will report as successful even if some replicas did not successfully load. For any failed replicas, they will be loaded upon request; see .

This can be done with the with the --metadataOnly option:

Set a with an appropriate cache control policy for the files.

The number of actual replicas can be greater than the target when the worker membership changes. Workers joining or exiting the cluster may alter the result of the replication selection algorithm, such that a worker containing a replica ends up is never selected, resulting in an inaccessible replica. This replica will linger in the worker's cache until it is evicted by the worker due to insufficient cache capacity or manually triggered by a .

cache filter
clear stale cache operation
creating replicas passively
distributed load
load job