# S3-API Write Optimization

{% hint style="warning" %}
This feature is experimental since AI-3.8.
{% endhint %}

This guide shows how to enable Write Cache on top of the [S3 API](/ee-ai-en/ai-3.8-15.1.x/data-access/s3-api.md), buffering `PUT` requests in local NVMe cache and persisting to UFS asynchronously for millisecond-level write latency.

## Architecture Overview

The S3 API supports two deployment modes:

|                           | Standard Mode (Read Cache)                       | Write Cache Mode                                                  |
| ------------------------- | ------------------------------------------------ | ----------------------------------------------------------------- |
| **Use case**              | Accelerate reads from remote object storage      | Low-latency writes with async persistence                         |
| **FoundationDB**          | Not required                                     | Required                                                          |
| **Write policies**        | WRITE\_THROUGH                                   | WRITE\_THROUGH, WRITE\_BACK, TRANSIENT                            |
| **Deployment complexity** | Low                                              | Medium — requires FDB cluster and path-level policy configuration |
| **Typical workloads**     | AI model loading, data analytics, S3-based reads | Training checkpoints, ETL pipelines, hybrid-cloud write buffering |

> If your workload is read-heavy with occasional writes, the standard read-cache mode is sufficient — see [S3 API](/ee-ai-en/ai-3.8-15.1.x/data-access/s3-api.md).

### How Write Cache Works

Write Cache adds **FoundationDB (FDB)** to the standard S3 API deployment to provide strong consistency under concurrent writes. FDB is on the critical path for all metadata operations.

* **Write path** — `PUT` requests and MPU uploads land in FDB (metadata) then on the Worker's local NVMe (data). A background persistence thread uploads to UFS asynchronously.
* **Read path** — `GET` requests query FDB to locate the owning Worker, then read from local NVMe. On a cache miss, the Worker fetches from UFS and caches locally.

## Before You Start

Run these checks before starting. Skipping this step is the most common cause of deployment failures.

* [ ] **S3 API is already set up and working** — Write Cache builds on top of it:

  ```shell
  kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
    alluxio conf get alluxio.worker.s3.api.enabled
  ```

  Expected: `true`
* [ ] **Alluxio Operator is running**:

  ```shell
  kubectl get pods -n alluxio-operator
  ```

  Expected: all pods `Running`

## Deployment Steps

### 1. Install FDB CRDs

Enable the FDB Operator in your `alluxio-operator.yaml` before installing or upgrading:

```yaml
fdb-operator:
  enabled: true
```

If upgrading an existing operator installation, manually apply the FDB CRDs (Helm does not install CRDs on upgrade):

```shell
# Idempotent
kubectl apply -f alluxio-operator/charts/fdb-operator/crds/
```

Verify the FDB operator is running:

```shell
kubectl get pods -n alluxio-operator -l app.kubernetes.io/name=fdb-operator
```

**✅ Success:** FDB operator pod shows `READY 1/1`, `STATUS = Running`.

```console
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-fdb-controller-6cbd5c7c45-xk2pq      1/1     Running   0          2m
```

> If the pod is not found, reinstall the operator with `fdb-operator.enabled=true` in `alluxio-operator.yaml` and re-run `helm upgrade`.

### 2. Enable Write Cache

Add the following to your `alluxio-cluster.yaml`, in addition to the base setup of [S3 API](/ee-ai-en/ai-3.8-15.1.x/data-access/s3-api.md#step-1-enable-the-s3-api):

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.worker.s3.api.enabled: "true"
    alluxio.write.cache.enabled: "true"
  fdb:
    enabled: true
```

Apply the configuration:

```shell
# Idempotent
kubectl apply -f alluxio-cluster.yaml
```

**✅ Success:** Helm/kubectl prints no errors and the cluster enters a reconciling state.

> If you see `unable to recognize "alluxio-cluster.yaml": no matches for kind "AlluxioCluster"`, the Alluxio Operator CRDs are not installed — reinstall the operator first.

### 3. Verify Deployment

Wait for workers to be ready (startup typically takes 2–3 minutes):

```shell
kubectl wait --for=condition=Ready pod \
  -l app.kubernetes.io/component=worker \
  -n <NAMESPACE> --timeout=120s
```

**✅ Success:** Output shows all worker pods reached `Ready` condition.

```console
pod/alluxio-cluster-worker-0 condition met
```

Confirm FDB pods are running:

```shell
kubectl get pods -n <NAMESPACE> -l foundationdb.org/fdb-cluster-name=alluxio-cluster-fdb-meta
```

**✅ Success:** `cluster_controller`, `log`, and `storage` pods all show `Running`.

```console
NAME                                                   READY   STATUS    RESTARTS   AGE
alluxio-cluster-fdb-meta-cluster-controller-1-...      2/2     Running   0          2m
alluxio-cluster-fdb-meta-log-1-...                     2/2     Running   0          2m
alluxio-cluster-fdb-meta-storage-1-...                 2/2     Running   0          2m
```

> If FDB pods are stuck in `Pending`, check PVC availability: `kubectl get pvc -n <NAMESPACE>`. FDB requires a StorageClass with dynamic provisioning.

Confirm write cache is active on the coordinator:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio conf get alluxio.write.cache.enabled
```

**✅ Success:** Returns `true`.

### 4. Test Write and Read-After-Write

```shell
# Create a test file
echo "write cache test" > /tmp/test.txt

# Write an object via Alluxio S3 API
# Replace <LOAD_BALANCER_ADDRESS> with your load balancer hostname or IP.
# For testing without a load balancer, use port-forward instead:
#   kubectl port-forward -n <NAMESPACE> svc/alluxio-cluster-worker 29998:29998
aws s3 cp /tmp/test.txt s3://<BUCKET>/test.txt \
  --endpoint-url http://<LOAD_BALANCER_ADDRESS>:29998 \
  --no-sign-request
```

**✅ Success:**

```console
upload: /tmp/test.txt to s3://<BUCKET>/test.txt
```

Read back immediately (served from local cache, not UFS):

```shell
aws s3 cp s3://<BUCKET>/test.txt /tmp/verify.txt \
  --endpoint-url http://<LOAD_BALANCER_ADDRESS>:29998 \
  --no-sign-request

diff /tmp/test.txt /tmp/verify.txt
```

**✅ Success:** `diff` produces no output (files are identical).

> If the write returns `NoSuchBucket`: verify the mount is active with `kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio mount list`.

## Write Policies

By default, Write Cache uses `WRITE_THROUGH` (synchronous write to both cache and UFS). For low-latency writes, switch specific paths to `WRITE_BACK`.

| Policy          | Behavior                                                                           | When to Use                                         |
| --------------- | ---------------------------------------------------------------------------------- | --------------------------------------------------- |
| `WRITE_THROUGH` | (Default) Write to cache and UFS simultaneously. Succeeds only when both complete. | Durability-first; write latency is bounded by UFS   |
| `WRITE_BACK`    | Write to cache immediately, persist to UFS in background.                          | Low-latency writes with eventual durability         |
| `TRANSIENT`     | Cache only — never persisted to UFS.                                               | Temporary/recomputable data (e.g., shuffle outputs) |
| `READ_ONLY`     | Disallow all writes on this path.                                                  | Protect paths from accidental writes                |
| `NO_CACHE`      | Bypass cache; reads and writes go directly to UFS.                                 | Paths that should not be cached                     |

### Path-Level Configuration

Write policies are configured per path, allowing different policies for different workloads within the same cluster.

Edit the policy configuration interactively (run inside the coordinator pod):

```shell
kubectl exec -it -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio pathconfig edit
```

Example configuration — global default `WRITE_THROUGH`, with `WRITE_BACK` for checkpoint paths and `TRANSIENT` for shuffle:

```json
{
  "apiVersion": "v1.0",
  "defaultRule": {
    "description": "Global default",
    "policyMode": "WRITE_THROUGH",
    "properties": {
      "writeReplicas": 1
    }
  },
  "pathRules": [
    {
      "alluxioPath": "/checkpoints/**",
      "description": "Low-latency checkpoint writes",
      "policyMode": "WRITE_BACK",
      "properties": {
        "writeReplicas": 2
      }
    },
    {
      "alluxioPath": "/shuffle/**",
      "description": "Temporary shuffle data",
      "policyMode": "TRANSIENT",
      "properties": {
        "writeReplicas": 2
      }
    }
  ]
}
```

Verify a path resolves to the expected policy:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio pathconfig test --path /checkpoints/epoch-1/model.pt
```

**✅ Success:** Output includes `"policyMode": "WRITE_BACK"`.

Via REST API (for programmatic integration):

```shell
curl -X PUT \
  -H "Content-Type: application/json" \
  -d @pathconfig.json \
  http://<COORDINATOR_HOST>:<COORDINATOR_PORT>/api/v1/conf
```

### Multi-Replica Write

For `WRITE_BACK` and `TRANSIENT` paths, setting `writeReplicas > 1` keeps multiple copies of unpersisted data across different workers. This reduces the risk of data loss during the window before UFS persistence completes.

**Trade-off**: Higher replica count improves fault tolerance and read concurrency but increases intra-cluster network usage and write latency slightly.

Recommended settings:

* `WRITE_BACK` — `writeReplicas: 2` for production; `1` for maximum write throughput
* `TRANSIENT` — `writeReplicas: 2` or higher, since this data is never persisted to UFS

## Operations & Tuning

### Key Configuration

| Property                                                     | Default                           | Description                                                                                                                                                         |
| ------------------------------------------------------------ | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.write.cache.enabled`                                | `false`                           | Enables Write Cache.                                                                                                                                                |
| `alluxio.foundationdb.cluster.file.path`                     | `${alluxio.conf.dir}/fdb.cluster` | Path to the FDB cluster file. Auto-injected when FDB is deployed via Operator; set manually for external FDB.                                                       |
| `alluxio.write.cache.async.check.orphan.timeout`             | `1h`                              | Uncommitted writes older than this threshold are treated as abandoned and cleaned up.                                                                               |
| `alluxio.write.cache.async.file.check.period`                | `10min`                           | Scan interval for orphan detection. Shorter intervals increase FDB load.                                                                                            |
| `alluxio.worker.page.store.pinned.file.capacity.limit.ratio` | `0.3`                             | Maximum fraction of cache capacity for unpersisted (pinned) data. The remaining capacity is available for read cache (LRU-evictable).                               |
| `alluxio.worker.mark.writing.files.duration`                 | `10min`                           | If a file is open for write but receives no new data for this duration, the worker treats it as a dangling write eligible for cleanup. Timer resets on every write. |

### Async Persistence Retry

For `WRITE_BACK` paths, failed UFS uploads are retried with exponential backoff. Retries run in background threads and do not block front-end write acknowledgments.

| Property                                                          | Default | Description                                   |
| ----------------------------------------------------------------- | ------- | --------------------------------------------- |
| `alluxio.worker.write.cache.async.persist.retry.initial.interval` | `1s`    | Initial retry wait.                           |
| `alluxio.worker.write.cache.async.persist.retry.max.interval`     | `1h`    | Maximum retry wait (caps exponential growth). |

### Monitoring Async Persistence (15.1.3+)

Two CLI commands let you inspect in-flight persist operations:

```shell
# List all files pending or in-progress on a specific worker
kubectl exec -i -n <NAMESPACE> alluxio-cluster-worker-0 -- \
  alluxio async-persist list

# Check the persist state and retry count for a specific path
kubectl exec -i -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio async-persist stat --path /checkpoints/epoch-1/model.pt
```

Use `async-persist stat` when `alluxio fs ls` shows a file stuck in `NOT_PERSISTED` to determine whether the issue is in the queue or the upload itself.

### Cache Space Management

Worker cache space is divided into two logical regions:

* **Pinned space** (write cache): unpersisted dirty data — not evictable. Capped at 30% of total capacity by default (`alluxio.worker.page.store.pinned.file.capacity.limit.ratio`).
* **Evictable space** (read cache): persisted or UFS-loaded data — evicted LRU when space is needed.

If persistence throughput falls behind write traffic, pinned space fills up and Alluxio returns `out-of-space` errors. To prevent this:

* Ensure `alluxio.write.cache.async.persist.thread.pool.size` is sufficient for your write rate
* Monitor pinned space usage and adjust `alluxio.worker.page.store.pinned.file.capacity.limit.ratio` if needed
* Allocate adequate NVMe capacity for both regions

## Performance Reference

Reference numbers for `WRITE_BACK` on AWS `c5n.metal` clients + `i3en.metal` workers (100 Gbps network, NVMe SSD). Actual results vary by hardware, object size, and concurrency.

| Workload                                     | Write Cache                  | Direct S3            |
| -------------------------------------------- | ---------------------------- | -------------------- |
| Small object PUT (10 KB), low concurrency    | 3–5 ms                       | 30–60 ms             |
| Small object PUT (10 KB), medium concurrency | 4–9 ms                       | 30–60 ms             |
| Large object PUT (10 MB), single worker      | 3–6 GB/s sustained           | Variable (throttled) |
| GET after write (read-after-write latency)   | 3–7 ms                       | 90–130 ms            |
| Async persistence throughput                 | \~2,000 objects/s per worker | —                    |

Front-end write latency for `WRITE_BACK` is bounded by **local NVMe**, not UFS. Throughput scales near-linearly with additional workers.

## Uninstall

To remove the Write Cache configuration and FDB resources (reverse order of setup):

**1. Delete the AlluxioCluster** (removes workers, coordinator, and FDB pods):

```shell
kubectl delete -f alluxio-cluster.yaml
```

**2. Verify all Alluxio pods are removed:**

```shell
kubectl get pods -n <NAMESPACE>
```

**✅ Success:** `No resources found in <NAMESPACE> namespace.`

To disable Write Cache without deleting the cluster, set `alluxio.write.cache.enabled: "false"` in `alluxio-cluster.yaml` and re-apply:

```shell
kubectl apply -f alluxio-cluster.yaml  # Idempotent
```

## Troubleshooting

**FDB connection failure on startup** — FDB pods are not reachable from the workers.

```shell
kubectl get pods -n <NAMESPACE> -l foundationdb.org/fdb-cluster-name=alluxio-cluster-fdb-meta
```

Verify `alluxio.foundationdb.cluster.file.path` points to a valid FDB cluster file. When deployed via Operator, this is auto-injected.

***

**FDB operator OOM / high memory usage** — `globalMode: enabled: true` (the default) causes the FDB operator to watch all Pods, PVCs, ConfigMaps, and Services across the entire cluster, which can spike memory to several GBs in large clusters.

Fix: move the Alluxio Operator, FDB Operator, and AlluxioCluster into the **same namespace**, set `globalMode.enabled: false` in `alluxio-operator.yaml`, and restart the FDB operator pod.

***

**Out-of-space errors on write** — pinned (unpersisted) data has filled the write cache.

Fix: increase `alluxio.worker.page.store.pinned.file.capacity.limit.ratio`, add NVMe capacity, or increase `alluxio.write.cache.async.persist.thread.pool.size` so persistence keeps up with writes.

***

**WRITE\_BACK data not appearing in UFS** — verify async persistence threads are running:

```shell
kubectl logs -n <NAMESPACE> -l app.kubernetes.io/component=worker --tail=100 | grep -i persist
```

Also check `alluxio.worker.write.cache.async.persist.retry.max.interval` — if UFS is unreachable, retries may be in a long backoff cycle.

***

**Orphan files accumulating** — uncommitted writes left by crashed clients. Reduce `alluxio.write.cache.async.check.orphan.timeout` to clean them up faster, or run:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio job free
```

***

**Directory deletion returns DEADLINE\_EXCEEDED** — running `alluxio fs rm -R` on a `WRITE_BACK` path may time out with `DEADLINE_EXCEEDED`. Despite the error, files **may have already been deleted from UFS** before the timeout. Verify UFS state directly before retrying:

```shell
aws s3 ls s3://<BUCKET>/<path>/ --recursive | head -20
```

If the files are gone from S3, the deletion succeeded. Re-running `rm -R` on the Alluxio path will confirm with `Path does not exist`. Pagestore disk space may not shrink immediately — orphaned pages are reclaimed on the next eviction cycle.

***

**pathconfig not taking effect** — verify the policy resolved correctly:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio pathconfig test --path <YOUR_PATH>
```

If the path still shows the old policy, check coordinator logs for config reload activity.

***

**Data not evicted after persistence** — eviction only triggers when cache is under pressure. To proactively free space:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio job free
```

## See Also

* [FUSE Write Optimization](/ee-ai-en/ai-3.8-15.1.x/performance/fuse-write-cache.md) — use the same Write Cache backend via POSIX filesystem interface
* [S3 API](/ee-ai-en/ai-3.8-15.1.x/data-access/s3-api.md) — base endpoint, auth, load balancer, and client compatibility (required before enabling Write Cache)
* [S3 UFS Integration](/ee-ai-en/ai-3.8-15.1.x/ufs/s3.md) — tuning the underlying S3 persistence layer (upload threads, multipart settings)
* [S3 API Benchmarks](/ee-ai-en/ai-3.8-15.1.x/benchmark/s3-api.md) — reference baselines, tool selection (COSBench / Warp / httpbench), and tuning for S3 API workloads


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/ai-3.8-15.1.x/performance/s3-write-cache.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
