S3-API Write Optimization

This guide shows how to enable Write Cache on top of the S3 API, buffering PUT requests in local NVMe cache and persisting to UFS asynchronously for millisecond-level write latency.

Architecture Overview

The S3 API supports two deployment modes:

Standard Mode (Read Cache)
Write Cache Mode

Use case

Accelerate reads from remote object storage

Low-latency writes with async persistence

FoundationDB

Not required

Required

Write policies

WRITE_THROUGH

WRITE_THROUGH, WRITE_BACK, TRANSIENT

Deployment complexity

Low

Medium — requires FDB cluster and path-level policy configuration

Typical workloads

AI model loading, data analytics, S3-based reads

Training checkpoints, ETL pipelines, hybrid-cloud write buffering

If your workload is read-heavy with occasional writes, the standard read-cache mode is sufficient — see S3 API.

How Write Cache Works

Write Cache adds FoundationDB (FDB) to the standard S3 API deployment to provide strong consistency under concurrent writes. FDB is on the critical path for all metadata operations.

  • Write pathPUT requests and MPU uploads land in FDB (metadata) then on the Worker's local NVMe (data). A background persistence thread uploads to UFS asynchronously.

  • Read pathGET requests query FDB to locate the owning Worker, then read from local NVMe. On a cache miss, the Worker fetches from UFS and caches locally.

Before You Start

Run these checks before starting. Skipping this step is the most common cause of deployment failures.

Deployment Steps

1. Install FDB CRDs

Enable the FDB Operator in your alluxio-operator.yaml before installing or upgrading:

If upgrading an existing operator installation, manually apply the FDB CRDs (Helm does not install CRDs on upgrade):

Verify the FDB operator is running:

✅ Success: FDB operator pod shows READY 1/1, STATUS = Running.

If the pod is not found, reinstall the operator with fdb-operator.enabled=true in alluxio-operator.yaml and re-run helm upgrade.

2. Enable Write Cache

Add the following to your alluxio-cluster.yaml, in addition to the base setup of S3 API:

Apply the configuration:

✅ Success: Helm/kubectl prints no errors and the cluster enters a reconciling state.

If you see unable to recognize "alluxio-cluster.yaml": no matches for kind "AlluxioCluster", the Alluxio Operator CRDs are not installed — reinstall the operator first.

3. Verify Deployment

Wait for workers to be ready (startup typically takes 2–3 minutes):

✅ Success: Output shows all worker pods reached Ready condition.

Confirm FDB pods are running:

✅ Success: cluster_controller, log, and storage pods all show Running.

If FDB pods are stuck in Pending, check PVC availability: kubectl get pvc -n <NAMESPACE>. FDB requires a StorageClass with dynamic provisioning.

Confirm write cache is active on the coordinator:

✅ Success: Returns true.

4. Test Write and Read-After-Write

✅ Success:

Read back immediately (served from local cache, not UFS):

✅ Success: diff produces no output (files are identical).

If the write returns NoSuchBucket: verify the mount is active with kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio mount list.

Write Policies

By default, Write Cache uses WRITE_THROUGH (synchronous write to both cache and UFS). For low-latency writes, switch specific paths to WRITE_BACK.

Policy
Behavior
When to Use

WRITE_THROUGH

(Default) Write to cache and UFS simultaneously. Succeeds only when both complete.

Durability-first; write latency is bounded by UFS

WRITE_BACK

Write to cache immediately, persist to UFS in background.

Low-latency writes with eventual durability

TRANSIENT

Cache only — never persisted to UFS.

Temporary/recomputable data (e.g., shuffle outputs)

READ_ONLY

Disallow all writes on this path.

Protect paths from accidental writes

NO_CACHE

Bypass cache; reads and writes go directly to UFS.

Paths that should not be cached

Path-Level Configuration

Write policies are configured per path, allowing different policies for different workloads within the same cluster.

Edit the policy configuration interactively (run inside the coordinator pod):

Example configuration — global default WRITE_THROUGH, with WRITE_BACK for checkpoint paths and TRANSIENT for shuffle:

Verify a path resolves to the expected policy:

✅ Success: Output includes "policyMode": "WRITE_BACK".

Via REST API (for programmatic integration):

Multi-Replica Write

For WRITE_BACK and TRANSIENT paths, setting writeReplicas > 1 keeps multiple copies of unpersisted data across different workers. This reduces the risk of data loss during the window before UFS persistence completes.

Trade-off: Higher replica count improves fault tolerance and read concurrency but increases intra-cluster network usage and write latency slightly.

Recommended settings:

  • WRITE_BACKwriteReplicas: 2 for production; 1 for maximum write throughput

  • TRANSIENTwriteReplicas: 2 or higher, since this data is never persisted to UFS

Operations & Tuning

Key Configuration

Property
Default
Description

alluxio.write.cache.enabled

false

Enables Write Cache.

alluxio.foundationdb.cluster.file.path

${alluxio.conf.dir}/fdb.cluster

Path to the FDB cluster file. Auto-injected when FDB is deployed via Operator; set manually for external FDB.

alluxio.write.cache.async.check.orphan.timeout

1h

Uncommitted writes older than this threshold are treated as abandoned and cleaned up.

alluxio.write.cache.async.file.check.period

10min

Scan interval for orphan detection. Shorter intervals increase FDB load.

alluxio.worker.page.store.pinned.file.capacity.limit.ratio

0.3

Maximum fraction of cache capacity for unpersisted (pinned) data. The remaining capacity is available for read cache (LRU-evictable).

alluxio.worker.mark.writing.files.duration

10min

If a file is open for write but receives no new data for this duration, the worker treats it as a dangling write eligible for cleanup. Timer resets on every write.

Async Persistence Retry

For WRITE_BACK paths, failed UFS uploads are retried with exponential backoff. Retries run in background threads and do not block front-end write acknowledgments.

Property
Default
Description

alluxio.worker.write.cache.async.persist.retry.initial.interval

1s

Initial retry wait.

alluxio.worker.write.cache.async.persist.retry.max.interval

1h

Maximum retry wait (caps exponential growth).

Monitoring Async Persistence (15.1.3+)

Two CLI commands let you inspect in-flight persist operations:

Use async-persist stat when alluxio fs ls shows a file stuck in NOT_PERSISTED to determine whether the issue is in the queue or the upload itself.

Cache Space Management

Worker cache space is divided into two logical regions:

  • Pinned space (write cache): unpersisted dirty data — not evictable. Capped at 30% of total capacity by default (alluxio.worker.page.store.pinned.file.capacity.limit.ratio).

  • Evictable space (read cache): persisted or UFS-loaded data — evicted LRU when space is needed.

If persistence throughput falls behind write traffic, pinned space fills up and Alluxio returns out-of-space errors. To prevent this:

  • Ensure alluxio.write.cache.async.persist.thread.pool.size is sufficient for your write rate

  • Monitor pinned space usage and adjust alluxio.worker.page.store.pinned.file.capacity.limit.ratio if needed

  • Allocate adequate NVMe capacity for both regions

Performance Reference

Reference numbers for WRITE_BACK on AWS c5n.metal clients + i3en.metal workers (100 Gbps network, NVMe SSD). Actual results vary by hardware, object size, and concurrency.

Workload
Write Cache
Direct S3

Small object PUT (10 KB), low concurrency

3–5 ms

30–60 ms

Small object PUT (10 KB), medium concurrency

4–9 ms

30–60 ms

Large object PUT (10 MB), single worker

3–6 GB/s sustained

Variable (throttled)

GET after write (read-after-write latency)

3–7 ms

90–130 ms

Async persistence throughput

~2,000 objects/s per worker

Front-end write latency for WRITE_BACK is bounded by local NVMe, not UFS. Throughput scales near-linearly with additional workers.

Uninstall

To remove the Write Cache configuration and FDB resources (reverse order of setup):

1. Delete the AlluxioCluster (removes workers, coordinator, and FDB pods):

2. Verify all Alluxio pods are removed:

✅ Success: No resources found in <NAMESPACE> namespace.

To disable Write Cache without deleting the cluster, set alluxio.write.cache.enabled: "false" in alluxio-cluster.yaml and re-apply:

Troubleshooting

FDB connection failure on startup — FDB pods are not reachable from the workers.

Verify alluxio.foundationdb.cluster.file.path points to a valid FDB cluster file. When deployed via Operator, this is auto-injected.


FDB operator OOM / high memory usageglobalMode: enabled: true (the default) causes the FDB operator to watch all Pods, PVCs, ConfigMaps, and Services across the entire cluster, which can spike memory to several GBs in large clusters.

Fix: move the Alluxio Operator, FDB Operator, and AlluxioCluster into the same namespace, set globalMode.enabled: false in alluxio-operator.yaml, and restart the FDB operator pod.


Out-of-space errors on write — pinned (unpersisted) data has filled the write cache.

Fix: increase alluxio.worker.page.store.pinned.file.capacity.limit.ratio, add NVMe capacity, or increase alluxio.write.cache.async.persist.thread.pool.size so persistence keeps up with writes.


WRITE_BACK data not appearing in UFS — verify async persistence threads are running:

Also check alluxio.worker.write.cache.async.persist.retry.max.interval — if UFS is unreachable, retries may be in a long backoff cycle.


Orphan files accumulating — uncommitted writes left by crashed clients. Reduce alluxio.write.cache.async.check.orphan.timeout to clean them up faster, or run:


Directory deletion returns DEADLINE_EXCEEDED — running alluxio fs rm -R on a WRITE_BACK path may time out with DEADLINE_EXCEEDED. Despite the error, files may have already been deleted from UFS before the timeout. Verify UFS state directly before retrying:

If the files are gone from S3, the deletion succeeded. Re-running rm -R on the Alluxio path will confirm with Path does not exist. Pagestore disk space may not shrink immediately — orphaned pages are reclaimed on the next eviction cycle.


pathconfig not taking effect — verify the policy resolved correctly:

If the path still shows the old policy, check coordinator logs for config reload activity.


Data not evicted after persistence — eviction only triggers when cache is under pressure. To proactively free space:

See Also

  • FUSE Write Optimization — use the same Write Cache backend via POSIX filesystem interface

  • S3 API — base endpoint, auth, load balancer, and client compatibility (required before enabling Write Cache)

  • S3 UFS Integration — tuning the underlying S3 persistence layer (upload threads, multipart settings)

  • S3 API Benchmarks — reference baselines, tool selection (COSBench / Warp / httpbench), and tuning for S3 API workloads

Last updated