S3 Write Cache

circle-exclamation

Background

As cloud-native architectures become ubiquitous, Amazon S3 and S3-compatible protocols have become the de facto standard for durable data persistence. However, standard object storage services are typically optimized for cold data archiving or general-purpose access. In high-throughput compute scenarios, it is often difficult to achieve both high read and write performance.

To address these challenges, we built Alluxio S3 Write Cache (referred to as S3 Write Cache below). Deployed on the compute side and backed by high-performance NVMe/SSD media, it provides an S3-compatible data layer with ultra-low latency and high throughput, along with intelligent data lifecycle management.

Core Values and Use Cases

  • Low-latency writes: By intercepting client PUT requests, data is written directly to NVMe/SSD storage managed by Alluxio Workers, without waiting for uploads to remote object storage. This minimizes the impact of network variability and backend throttling, allowing the system to absorb bursty write traffic with millisecond-level write acknowledgments.

  • Fast read-after-write: For hot data retained in the cache, client GET requests are served directly by the Worker. Read throughput can approach the available network bandwidth, completely avoiding the high latency of reading back from remote object storage.

  • Reliable asynchronous persistence: Data written to the cache is not confined locally. The system asynchronously persists data to the configured Under File System (UFS), such as Amazon S3 or HDFS, in the background. This keeps the write path low-latency while ensuring eventual durability and data safety.

  • Automated space management: To sustain high write performance, the system implements an eviction mechanism tightly coupled with persistence state. Once data has been safely persisted to the UFS, cold data can be automatically evicted to free NVMe/SSD capacity for new hot data, without manual operational intervention.

  • Transparent unified access: Whether data currently resides in the Worker-side cache or has been evicted to the UFS, clients access it through the same S3 endpoint. If requested data has been evicted, Alluxio fetches it from the UFS and re-caches it on demand. This process is fully transparent to applications and requires no code changes.

This component is particularly suitable for read-write-mixed workloads that are sensitive to write latency, including:

  • AI/LLM training checkpoint and resume: Rapidly write large model checkpoints and restore them quickly when training jobs fail or restart.

  • Big data ETL pipelines: Buffer intermediate outputs (such as stage results or shuffle data) so downstream tasks can read with minimal waiting.

  • Hybrid cloud and disaggregated storage architectures: Build a high-performance buffer layer between compute clusters and remote object storage to mask network latency.

Deployment & Usage

Deployment

The S3 Write Cache is integrated into the Alluxio Operator. Set the following additional configurations in alluxio-cluster.yaml.

After deployment is complete, you can access Alluxio via the S3 protocol. Reference: S3 API.

Note: The default behavior of S3 Write Cache is WRITE_THROUGH (synchronous write). If you need to configure it as WRITE_BACK (asynchronous write) for higher performance, please refer to the Path Configuration section below.

Path Configuration

Once S3 Write Cache is enabled, Alluxio allows different paths to use different write policies, making it possible to tailor behavior to specific workload requirements.

Policy
Description

WRITE_THROUGH

(Default) Synchronous write. Data is written to the cache and UFS at the same time, and the request succeeds only when both complete. Suitable for durability-first workloads, but write latency is bounded by UFS.

TRANSIENT

Temporary storage. Data is read and written only within the cache and does not interact with UFS. Suitable for performance-first, non-persistent data.

WRITE_BACK

Asynchronous write-back. Writes return success as soon as data lands in the cache, followed by background persistence to UFS. Suitable for low-latency writes with eventual durability.

READ_ONLY

Disallows all write operations.

NO_CACHE

Bypasses the cache entirely; reads and writes go directly to UFS.

Choosing a write policy:

  1. Writes must be immediately recoverable and traceable: use WRITE_THROUGH

  2. Low-latency writes with eventual durability: use WRITE_BACK

  3. Purely temporary data (discardable or recomputable): use TRANSIENT

Here is the polished English translation of the Multi-replica Write section, tailored for technical documentation:

Multi-replica Write

Under the TRANSIENT and WRITE_BACK policies, Alluxio supports multi-replica writes. Increasing the number of data replicas within the cache layer significantly enhances data availability and fault tolerance before the data is successfully persisted to the Under File Storage (UFS).

In production environments, it is recommended to balance the replica count based on two key dimensions: Data Reliability and System Performance.

Data Reliability

  • WRITE_BACK Mode: Increasing replicas shortens the "risk window." Before asynchronous persistence to the UFS is complete, the multi-replica mechanism effectively mitigates the risk of data loss caused by a single Worker node failure.

  • TRANSIENT Mode: Since data in this mode is never persisted to the UFS, increasing the write replica count is the sole mechanism to ensure data durability. It prevents the permanent loss of temporary data due to cache node failures, thereby improving the robustness of intermediate computation results.

Performance Trade-off

  • Write Overhead: A higher replica count consumes additional intra-cluster network bandwidth and occupies more NVMe/SSD storage space. Furthermore, because multiple replicas must be written synchronously, write latency will increase slightly as the replica count rises.

  • Read Benefits: More replicas translate to stronger concurrent read capabilities. For "write-once, read-many" hot data, distributing multiple replicas across different Workers enables effective load balancing, significantly boosting read throughput and eliminating single-point access bottlenecks.

Configuration Management Command

Use the following command to edit path configurations online (similar to kubectl edit):

After the editor opens, you can input the JSON configuration. For example:

  1. Set the global default to WRITE_THROUGH.

  2. Enable WRITE_BACK for the path /foo/bar/write_back and all its subdirectories, setting writeReplicas to 1 (single-replica write).

  3. Enable TRANSIENT for the path /foo/bar/transient and all its subdirectories, setting writeReplicas to 2.

Verifying Configuration

After the edit takes effect, use the following command to test the policy hit by a specific path:

The following is output:

Path Configuration Management REST API

In addition, Alluxio provides a REST API for managing path configurations, which is more suitable for code-level integration with applications. Below is an example of using the REST API to set a path configuration:

Limitations and Dependencies

This section describes the key design constraints of S3 Write Cache related to consistency, reliability, and system behavior. These constraints primarily exist to ensure data safety when using the WRITE_BACK policy.

Dedicated Resource Overhead

S3 Write Cache is a high-performance component that requires dedicated compute resources (CPU and memory) and high-performance storage media (NVMe SSD).

  • For existing Alluxio users: Enabling S3 Write Cache does not significantly increase operational complexity, as it can reuse the existing Alluxio Operator and management framework. However, disk capacity and network bandwidth should be evaluated to ensure they can sustain the expected write and persistence workloads.

  • For new users: Deploying S3 Write Cache introduces an additional storage service layer between applications and the underlying storage system. Proper capacity planning is required for both cache storage and persistence bandwidth.

Metadata Service Dependency

To achieve high-throughput metadata operations with strong consistency guarantees, S3 Write Cache requires FoundationDB (FDB) as its metadata storage engine.

Before deployment, ensure that a functional FDB cluster is available, either:

  • deployed together with Alluxio via the Operator, or

  • provided as an external FDB cluster configured for Alluxio.

The stability and performance of FDB are on the critical path for metadata operations in the cache layer.

Eviction Policy and Persistence Dependency

The data lifecycle and eviction behavior of S3 Write Cache are strictly coupled to the persistence state of data:

  • No persistence, no eviction: To ensure zero data loss, data will never be selected for eviction before it has been successfully persisted to the UFS.

  • LRU after persistence: Only data that has been confirmed as safely stored in the UFS becomes eligible for eviction and follows an LRU (Least Recently Used) policy when cache space is needed.

  • No async persistence, permanent residency: If asynchronous persistence is not configured, all data written to the cache is treated as permanently resident and will not be automatically evicted. In this mode, the Alluxio management command alluxio job free cannot reclaim this space; capacity can only be released by explicitly deleting data.

These constraints are fundamental to the reliability guarantees of the WRITE_BACK policy and should be carefully considered when planning cache capacity and persistence throughput.

Operations and Tuning

This section describes operational considerations and tuning knobs for running S3 Write Cache in production, with a focus on reliability, capacity management, and performance predictability.

Configuration Overview

The default configuration of S3 Write Cache is suitable for most workloads. For advanced tuning or capacity planning, the following configuration options are commonly adjusted:

Property
Default
Description

alluxio.write.cache.enabled

false

Enables the S3 Write Cache feature.

alluxio.foundationdb.cluster.file.path

Path to the FDB cluster configuration file. Automatically injected when FDB is deployed via the Operator; must be set manually when using an external FDB cluster.

alluxio.write.cache.async.check.orphan.timeout

1 hour

Timeout for orphan files. If a write has not been committed within this period, it is treated as abandoned data and cleaned up.

alluxio.write.cache.async.file.check.period

10 min

Scan interval for orphan file detection. Shorter intervals increase FDB load.

alluxio.write.cache.async.persist.thread.pool.size

16

Concurrency of asynchronous persistence threads per Worker. Effective only for WRITE_BACK.

Asynchronous Persistence Retry Policy

For paths configured with WRITE_BACK, Alluxio retries uploads to the UFS when failures occur, using an exponential backoff strategy.

  • Retry intervals grow exponentially with each failure (e.g., 1s → 2s → 4s), up to a configured maximum.

  • Retries are executed in background persistence threads and do not block front-end write acknowledgments.

This mechanism prevents excessive retry traffic during UFS outages or congestion, protecting backend storage systems from overload.

Property
Default
Description

alluxio.worker.write.cache.async.persist.retry.initial.interval

1s

Initial retry interval.

alluxio.worker.write.cache.async.persist.retry.max.interval

1h

Maximum retry interval.

Risk of Cache Space Exhaustion

Unpersisted data is protected from eviction. As a result, if asynchronous persistence is disabled or if persistence bandwidth falls behind incoming write traffic, dirty data will continue to accumulate in the cache.

Once physical storage capacity is exhausted, Alluxio will return an out-of-space error and reject subsequent write requests.

Recommendations:

  • Enable asynchronous persistence for all WRITE_BACK paths.

  • Ensure sufficient persistence throughput relative to write traffic.

  • Allocate adequate cache capacity and monitor pinned space usage.

Read and Write Cache Space Division

When S3 Write Cache is enabled, the physical cache space managed by each Worker is logically divided into two regions:

  • Write cache (pinned space): Contains data that is actively being written or waiting to be persisted (dirty data). This space is not evictable.

  • Read cache (evictable space): Contains data loaded from the UFS or data that has already been successfully persisted. This space follows an LRU policy and can be evicted automatically.

Both regions share the same Worker-managed physical storage. By default, pinned space is limited to 30% of total cache capacity. This limit can be adjusted using the following configuration:

Property
Default
Description

alluxio.worker.page.store.pinned.file.capacity.limit.ratio

0.3

Maximum fraction (0.0–1.0) of cache capacity that may be occupied by pinned (non-evictable) data. For example, 0.3 reserves 70% for read cache.

Performance Reference

This section provides reference performance ranges for S3 Write Cache (WRITE_BACK) under representative workloads. The purpose is to help users build intuition for performance expectations and capacity planning. Actual results may vary depending on hardware configuration, network conditions, object size, and concurrency.

Test Environment Overview

  • Client: AWS c5n.metal (100 Gbps network)

  • Worker: AWS i3en.metal (local NVMe SSD)

  • Tools: Warp / COSBench

  • Object sizes: 10 KB / 1 MB / 10 MB

  • Concurrency: 1 → 256

Small Object Write Latency (10 KB PUT)

For metadata-heavy and high-frequency small object workloads, S3 Write Cache significantly reduces write latency:

  • WRITE_BACK latency:

    • Low concurrency: 3–5 ms

    • Medium concurrency: 4–9 ms

  • Direct S3 latency:

    • Typically 30–60 ms

Takeaway: At low to moderate concurrency, small-object write latency is reduced by ~10×, which is critical for latency-sensitive pipelines.

Large Object Write Throughput (10 MB PUT)

For large, sequential write workloads, S3 Write Cache efficiently utilizes local NVMe bandwidth:

  • Single Worker throughput:

    • Sustained 3–6 GB/s

  • Performance characteristics:

    • Write latency remains within tens of milliseconds

    • Throughput scales near-linearly with the number of Workers

Comparison note: Direct S3 throughput is often constrained by bucket partitioning and service-side throttling, leading to less predictable bandwidth and tail latency under load.

Read-After-Write Performance

For pipeline-style workloads where data is consumed immediately after being produced (e.g., AI training, ETL stage handoff):

  • GET latency:

    • S3 Write Cache: 3–7 ms

    • Direct S3: 90–130 ms

  • GET throughput:

    • Improvement of approximately 4–8×

Takeaway: S3 Write Cache effectively turns newly written data into instant hot data, eliminating remote read round trips and read amplification.

Asynchronous Persistence Characteristics (WRITE_BACK)

With WRITE_BACK, front-end writes are fully decoupled from backend persistence:

  • Additional write-path overhead: ~0 ms

  • Background persistence capacity:

    • ~2000 objects/s per Worker

  • System behavior:

    • Front-end write performance is isolated from UFS fluctuations

    • Persistence throughput can be tuned independently via concurrency and bandwidth

Summary

  • S3 Write Cache prioritizes low write latency and predictable throughput

  • Performance is primarily bounded by local disk and network capabilities

  • Horizontal scaling with multiple Workers provides stable aggregate bandwidth

  • The goal is not to replace the absolute throughput limits of object storage, but to mask latency and performance jitter behind a fast, compute-side cache layer

Last updated