S3-API Write Optimization
This feature is experimental since AI-3.8.
This guide shows how to enable Write Cache on top of the S3 API, buffering PUT requests in local NVMe cache and persisting to UFS asynchronously for millisecond-level write latency.
Architecture Overview
The S3 API supports two deployment modes:
Use case
Accelerate reads from remote object storage
Low-latency writes with async persistence
FoundationDB
Not required
Required
Write policies
WRITE_THROUGH
WRITE_THROUGH, WRITE_BACK, TRANSIENT
Deployment complexity
Low
Medium — requires FDB cluster and path-level policy configuration
Typical workloads
AI model loading, data analytics, S3-based reads
Training checkpoints, ETL pipelines, hybrid-cloud write buffering
If your workload is read-heavy with occasional writes, the standard read-cache mode is sufficient — see S3 API.
How Write Cache Works
Write Cache adds FoundationDB (FDB) to the standard S3 API deployment to provide strong consistency under concurrent writes. FDB is on the critical path for all metadata operations.
Write path —
PUTrequests and MPU uploads land in FDB (metadata) then on the Worker's local NVMe (data). A background persistence thread uploads to UFS asynchronously.Read path —
GETrequests query FDB to locate the owning Worker, then read from local NVMe. On a cache miss, the Worker fetches from UFS and caches locally.
Before You Start
Run these checks before starting. Skipping this step is the most common cause of deployment failures.
Deployment Steps
1. Install FDB CRDs
Enable the FDB Operator in your alluxio-operator.yaml before installing or upgrading:
If upgrading an existing operator installation, manually apply the FDB CRDs (Helm does not install CRDs on upgrade):
Verify the FDB operator is running:
✅ Success: FDB operator pod shows READY 1/1, STATUS = Running.
If the pod is not found, reinstall the operator with
fdb-operator.enabled=trueinalluxio-operator.yamland re-runhelm upgrade.
2. Enable Write Cache
Add the following to your alluxio-cluster.yaml, in addition to the base setup of S3 API:
Apply the configuration:
✅ Success: Helm/kubectl prints no errors and the cluster enters a reconciling state.
If you see
unable to recognize "alluxio-cluster.yaml": no matches for kind "AlluxioCluster", the Alluxio Operator CRDs are not installed — reinstall the operator first.
3. Verify Deployment
Wait for workers to be ready (startup typically takes 2–3 minutes):
✅ Success: Output shows all worker pods reached Ready condition.
Confirm FDB pods are running:
✅ Success: cluster_controller, log, and storage pods all show Running.
If FDB pods are stuck in
Pending, check PVC availability:kubectl get pvc -n <NAMESPACE>. FDB requires a StorageClass with dynamic provisioning.
Confirm write cache is active on the coordinator:
✅ Success: Returns true.
4. Test Write and Read-After-Write
✅ Success:
Read back immediately (served from local cache, not UFS):
✅ Success: diff produces no output (files are identical).
If the write returns
NoSuchBucket: verify the mount is active withkubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio mount list.
Write Policies
By default, Write Cache uses WRITE_THROUGH (synchronous write to both cache and UFS). For low-latency writes, switch specific paths to WRITE_BACK.
WRITE_THROUGH
(Default) Write to cache and UFS simultaneously. Succeeds only when both complete.
Durability-first; write latency is bounded by UFS
WRITE_BACK
Write to cache immediately, persist to UFS in background.
Low-latency writes with eventual durability
TRANSIENT
Cache only — never persisted to UFS.
Temporary/recomputable data (e.g., shuffle outputs)
READ_ONLY
Disallow all writes on this path.
Protect paths from accidental writes
NO_CACHE
Bypass cache; reads and writes go directly to UFS.
Paths that should not be cached
Path-Level Configuration
Write policies are configured per path, allowing different policies for different workloads within the same cluster.
Edit the policy configuration interactively (run inside the coordinator pod):
Example configuration — global default WRITE_THROUGH, with WRITE_BACK for checkpoint paths and TRANSIENT for shuffle:
Verify a path resolves to the expected policy:
✅ Success: Output includes "policyMode": "WRITE_BACK".
Via REST API (for programmatic integration):
Multi-Replica Write
For WRITE_BACK and TRANSIENT paths, setting writeReplicas > 1 keeps multiple copies of unpersisted data across different workers. This reduces the risk of data loss during the window before UFS persistence completes.
Trade-off: Higher replica count improves fault tolerance and read concurrency but increases intra-cluster network usage and write latency slightly.
Recommended settings:
WRITE_BACK—writeReplicas: 2for production;1for maximum write throughputTRANSIENT—writeReplicas: 2or higher, since this data is never persisted to UFS
Operations & Tuning
Key Configuration
alluxio.write.cache.enabled
false
Enables Write Cache.
alluxio.foundationdb.cluster.file.path
${alluxio.conf.dir}/fdb.cluster
Path to the FDB cluster file. Auto-injected when FDB is deployed via Operator; set manually for external FDB.
alluxio.write.cache.async.check.orphan.timeout
1h
Uncommitted writes older than this threshold are treated as abandoned and cleaned up.
alluxio.write.cache.async.file.check.period
10min
Scan interval for orphan detection. Shorter intervals increase FDB load.
alluxio.worker.page.store.pinned.file.capacity.limit.ratio
0.3
Maximum fraction of cache capacity for unpersisted (pinned) data. The remaining capacity is available for read cache (LRU-evictable).
alluxio.worker.mark.writing.files.duration
10min
If a file is open for write but receives no new data for this duration, the worker treats it as a dangling write eligible for cleanup. Timer resets on every write.
Async Persistence Retry
For WRITE_BACK paths, failed UFS uploads are retried with exponential backoff. Retries run in background threads and do not block front-end write acknowledgments.
alluxio.worker.write.cache.async.persist.retry.initial.interval
1s
Initial retry wait.
alluxio.worker.write.cache.async.persist.retry.max.interval
1h
Maximum retry wait (caps exponential growth).
Monitoring Async Persistence (15.1.3+)
Two CLI commands let you inspect in-flight persist operations:
Use async-persist stat when alluxio fs ls shows a file stuck in NOT_PERSISTED to determine whether the issue is in the queue or the upload itself.
Cache Space Management
Worker cache space is divided into two logical regions:
Pinned space (write cache): unpersisted dirty data — not evictable. Capped at 30% of total capacity by default (
alluxio.worker.page.store.pinned.file.capacity.limit.ratio).Evictable space (read cache): persisted or UFS-loaded data — evicted LRU when space is needed.
If persistence throughput falls behind write traffic, pinned space fills up and Alluxio returns out-of-space errors. To prevent this:
Ensure
alluxio.write.cache.async.persist.thread.pool.sizeis sufficient for your write rateMonitor pinned space usage and adjust
alluxio.worker.page.store.pinned.file.capacity.limit.ratioif neededAllocate adequate NVMe capacity for both regions
Performance Reference
Reference numbers for WRITE_BACK on AWS c5n.metal clients + i3en.metal workers (100 Gbps network, NVMe SSD). Actual results vary by hardware, object size, and concurrency.
Small object PUT (10 KB), low concurrency
3–5 ms
30–60 ms
Small object PUT (10 KB), medium concurrency
4–9 ms
30–60 ms
Large object PUT (10 MB), single worker
3–6 GB/s sustained
Variable (throttled)
GET after write (read-after-write latency)
3–7 ms
90–130 ms
Async persistence throughput
~2,000 objects/s per worker
—
Front-end write latency for WRITE_BACK is bounded by local NVMe, not UFS. Throughput scales near-linearly with additional workers.
Uninstall
To remove the Write Cache configuration and FDB resources (reverse order of setup):
1. Delete the AlluxioCluster (removes workers, coordinator, and FDB pods):
2. Verify all Alluxio pods are removed:
✅ Success: No resources found in <NAMESPACE> namespace.
To disable Write Cache without deleting the cluster, set alluxio.write.cache.enabled: "false" in alluxio-cluster.yaml and re-apply:
Troubleshooting
FDB connection failure on startup — FDB pods are not reachable from the workers.
Verify alluxio.foundationdb.cluster.file.path points to a valid FDB cluster file. When deployed via Operator, this is auto-injected.
FDB operator OOM / high memory usage — globalMode: enabled: true (the default) causes the FDB operator to watch all Pods, PVCs, ConfigMaps, and Services across the entire cluster, which can spike memory to several GBs in large clusters.
Fix: move the Alluxio Operator, FDB Operator, and AlluxioCluster into the same namespace, set globalMode.enabled: false in alluxio-operator.yaml, and restart the FDB operator pod.
Out-of-space errors on write — pinned (unpersisted) data has filled the write cache.
Fix: increase alluxio.worker.page.store.pinned.file.capacity.limit.ratio, add NVMe capacity, or increase alluxio.write.cache.async.persist.thread.pool.size so persistence keeps up with writes.
WRITE_BACK data not appearing in UFS — verify async persistence threads are running:
Also check alluxio.worker.write.cache.async.persist.retry.max.interval — if UFS is unreachable, retries may be in a long backoff cycle.
Orphan files accumulating — uncommitted writes left by crashed clients. Reduce alluxio.write.cache.async.check.orphan.timeout to clean them up faster, or run:
Directory deletion returns DEADLINE_EXCEEDED — running alluxio fs rm -R on a WRITE_BACK path may time out with DEADLINE_EXCEEDED. Despite the error, files may have already been deleted from UFS before the timeout. Verify UFS state directly before retrying:
If the files are gone from S3, the deletion succeeded. Re-running rm -R on the Alluxio path will confirm with Path does not exist. Pagestore disk space may not shrink immediately — orphaned pages are reclaimed on the next eviction cycle.
pathconfig not taking effect — verify the policy resolved correctly:
If the path still shows the old policy, check coordinator logs for config reload activity.
Data not evicted after persistence — eviction only triggers when cache is under pressure. To proactively free space:
See Also
FUSE Write Optimization — use the same Write Cache backend via POSIX filesystem interface
S3 API — base endpoint, auth, load balancer, and client compatibility (required before enabling Write Cache)
S3 UFS Integration — tuning the underlying S3 persistence layer (upload threads, multipart settings)
S3 API Benchmarks — reference baselines, tool selection (COSBench / Warp / httpbench), and tuning for S3 API workloads
Last updated