# S3-API Write Optimization

{% hint style="warning" %}
This feature is experimental since AI-3.8.
{% endhint %}

This guide shows how to enable Write Cache on top of the [S3 API](https://documentation.alluxio.io/ee-ai-en/data-access/s3-api), buffering `PUT` requests in local NVMe cache and persisting to UFS asynchronously for millisecond-level write latency.

## Architecture Overview

The S3 API supports two deployment modes:

|                           | Standard Mode (Read Cache)                       | Write Cache Mode                                                  |
| ------------------------- | ------------------------------------------------ | ----------------------------------------------------------------- |
| **Use case**              | Accelerate reads from remote object storage      | Low-latency writes with async persistence                         |
| **FoundationDB**          | Not required                                     | Required                                                          |
| **Write policies**        | WRITE\_THROUGH                                   | WRITE\_THROUGH, WRITE\_BACK, TRANSIENT                            |
| **Deployment complexity** | Low                                              | Medium — requires FDB cluster and path-level policy configuration |
| **Typical workloads**     | AI model loading, data analytics, S3-based reads | Training checkpoints, ETL pipelines, hybrid-cloud write buffering |

> If your workload is read-heavy with occasional writes, the standard read-cache mode is sufficient — see [S3 API](https://documentation.alluxio.io/ee-ai-en/data-access/s3-api).

### How Write Cache Works

Write Cache adds **FoundationDB (FDB)** to the standard S3 API deployment to provide strong consistency under concurrent writes. FDB is on the critical path for all metadata operations.

* **Write path** — `PUT` requests and MPU uploads land in FDB (metadata) then on the Worker's local NVMe (data). A background persistence thread uploads to UFS asynchronously.
* **Read path** — `GET` requests query FDB to locate the owning Worker, then read from local NVMe. On a cache miss, the Worker fetches from UFS and caches locally.

## Before You Start

Run these checks before starting. Skipping this step is the most common cause of deployment failures.

* [ ] **S3 API is already set up and working** — Write Cache builds on top of it:

  ```shell
  kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
    alluxio conf get alluxio.worker.s3.api.enabled
  ```

  Expected: `true`
* [ ] **Alluxio Operator is running**:

  ```shell
  kubectl get pods -n alluxio-operator
  ```

  Expected: all pods `Running`

## Deployment Steps

### 1. Install FDB CRDs

Enable the FDB Operator in your `alluxio-operator.yaml` before installing or upgrading:

```yaml
fdb-operator:
  enabled: true
```

If upgrading an existing operator installation, manually apply the FDB CRDs (Helm does not install CRDs on upgrade):

```shell
# Idempotent
kubectl apply -f alluxio-operator/charts/fdb-operator/crds/
```

Verify the FDB operator is running:

```shell
kubectl get pods -n alluxio-operator -l app.kubernetes.io/name=fdb-operator
```

**✅ Success:** FDB operator pod shows `READY 1/1`, `STATUS = Running`.

```console
NAME                                          READY   STATUS    RESTARTS   AGE
alluxio-fdb-controller-6cbd5c7c45-xk2pq      1/1     Running   0          2m
```

> If the pod is not found, reinstall the operator with `fdb-operator.enabled=true` in `alluxio-operator.yaml` and re-run `helm upgrade`.

### 2. Enable Write Cache

Add the following to your `alluxio-cluster.yaml`, in addition to the base setup of [S3 API](https://documentation.alluxio.io/ee-ai-en/data-access/s3-api#step-1-enable-the-s3-api):

```yaml
apiVersion: k8s-operator.alluxio.com/v1
kind: AlluxioCluster
spec:
  properties:
    alluxio.worker.s3.api.enabled: "true"
    alluxio.write.cache.enabled: "true"
  fdb:
    enabled: true
```

Apply the configuration:

```shell
# Idempotent
kubectl apply -f alluxio-cluster.yaml
```

**✅ Success:** Helm/kubectl prints no errors and the cluster enters a reconciling state.

> If you see `unable to recognize "alluxio-cluster.yaml": no matches for kind "AlluxioCluster"`, the Alluxio Operator CRDs are not installed — reinstall the operator first.

### 3. Verify Deployment

Wait for workers to be ready (startup typically takes 2–3 minutes):

```shell
kubectl wait --for=condition=Ready pod \
  -l app.kubernetes.io/component=worker \
  -n <NAMESPACE> --timeout=120s
```

**✅ Success:** Output shows all worker pods reached `Ready` condition.

```console
pod/alluxio-cluster-worker-0 condition met
```

Confirm FDB pods are running:

```shell
kubectl get pods -n <NAMESPACE> -l foundationdb.org/fdb-cluster-name=alluxio-cluster-fdb-meta
```

**✅ Success:** `cluster_controller`, `log`, and `storage` pods all show `Running`.

```console
NAME                                                   READY   STATUS    RESTARTS   AGE
alluxio-cluster-fdb-meta-cluster-controller-1-...      2/2     Running   0          2m
alluxio-cluster-fdb-meta-log-1-...                     2/2     Running   0          2m
alluxio-cluster-fdb-meta-storage-1-...                 2/2     Running   0          2m
```

> If FDB pods are stuck in `Pending`, check PVC availability: `kubectl get pvc -n <NAMESPACE>`. FDB requires a StorageClass with dynamic provisioning.

Confirm write cache is active on the coordinator:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio conf get alluxio.write.cache.enabled
```

**✅ Success:** Returns `true`.

### 4. Test Write and Read-After-Write

```shell
# Create a test file
echo "write cache test" > /tmp/test.txt

# Write an object via Alluxio S3 API
# Replace <LOAD_BALANCER_ADDRESS> with your load balancer hostname or IP.
# For testing without a load balancer, use port-forward instead:
#   kubectl port-forward -n <NAMESPACE> svc/alluxio-cluster-worker 29998:29998
aws s3 cp /tmp/test.txt s3://<BUCKET>/test.txt \
  --endpoint-url http://<LOAD_BALANCER_ADDRESS>:29998 \
  --no-sign-request
```

**✅ Success:**

```console
upload: /tmp/test.txt to s3://<BUCKET>/test.txt
```

Read back immediately (served from local cache, not UFS):

```shell
aws s3 cp s3://<BUCKET>/test.txt /tmp/verify.txt \
  --endpoint-url http://<LOAD_BALANCER_ADDRESS>:29998 \
  --no-sign-request

diff /tmp/test.txt /tmp/verify.txt
```

**✅ Success:** `diff` produces no output (files are identical).

> If the write returns `NoSuchBucket`: verify the mount is active with `kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio mount list`.

## Write Policies

By default, Write Cache uses `WRITE_THROUGH` (synchronous write to both cache and UFS). For low-latency writes, switch specific paths to `WRITE_BACK`.

| Policy          | Behavior                                                                           | When to Use                                         |
| --------------- | ---------------------------------------------------------------------------------- | --------------------------------------------------- |
| `WRITE_THROUGH` | (Default) Write to cache and UFS simultaneously. Succeeds only when both complete. | Durability-first; write latency is bounded by UFS   |
| `WRITE_BACK`    | Write to cache immediately, persist to UFS in background.                          | Low-latency writes with eventual durability         |
| `TRANSIENT`     | Cache only — never persisted to UFS.                                               | Temporary/recomputable data (e.g., shuffle outputs) |
| `READ_ONLY`     | Disallow all writes on this path.                                                  | Protect paths from accidental writes                |
| `NO_CACHE`      | Bypass cache; reads and writes go directly to UFS.                                 | Paths that should not be cached                     |

### Path-Level Configuration

Write policies are configured per path, allowing different policies for different workloads within the same cluster.

Edit the policy configuration interactively (run inside the coordinator pod):

```shell
kubectl exec -it -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio pathconfig edit
```

Example configuration — global default `WRITE_THROUGH`, with `WRITE_BACK` for checkpoint paths and `TRANSIENT` for shuffle:

```json
{
  "apiVersion": "v1.0",
  "defaultRule": {
    "description": "Global default",
    "policyMode": "WRITE_THROUGH",
    "properties": {
      "writeReplicas": 1
    }
  },
  "pathRules": [
    {
      "alluxioPath": "/checkpoints/**",
      "description": "Low-latency checkpoint writes",
      "policyMode": "WRITE_BACK",
      "properties": {
        "writeReplicas": 2
      }
    },
    {
      "alluxioPath": "/shuffle/**",
      "description": "Temporary shuffle data",
      "policyMode": "TRANSIENT",
      "properties": {
        "writeReplicas": 2
      }
    }
  ]
}
```

Verify a path resolves to the expected policy:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio pathconfig test --path /checkpoints/epoch-1/model.pt
```

**✅ Success:** Output includes `"policyMode": "WRITE_BACK"`.

Via REST API (for programmatic integration):

```shell
curl -X PUT \
  -H "Content-Type: application/json" \
  -d @pathconfig.json \
  http://<COORDINATOR_HOST>:<COORDINATOR_PORT>/api/v1/conf
```

### Multi-Replica Write

For `WRITE_BACK` and `TRANSIENT` paths, setting `writeReplicas > 1` keeps multiple copies of unpersisted data across different workers. This reduces the risk of data loss during the window before UFS persistence completes.

**Trade-off**: Higher replica count improves fault tolerance and read concurrency but increases intra-cluster network usage and write latency slightly.

Recommended settings:

* `WRITE_BACK` — `writeReplicas: 2` for production; `1` for maximum write throughput
* `TRANSIENT` — `writeReplicas: 2` or higher, since this data is never persisted to UFS

## Operations & Tuning

### Key Configuration

| Property                                                     | Default                           | Description                                                                                                                           |
| ------------------------------------------------------------ | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.write.cache.enabled`                                | `false`                           | Enables Write Cache.                                                                                                                  |
| `alluxio.foundationdb.cluster.file.path`                     | `${alluxio.conf.dir}/fdb.cluster` | Path to the FDB cluster file. Auto-injected when FDB is deployed via Operator; set manually for external FDB.                         |
| `alluxio.write.cache.async.persist.thread.pool.size`         | `16`                              | Async persistence thread concurrency per Worker. Increase if persistence falls behind write traffic. Effective only for `WRITE_BACK`. |
| `alluxio.write.cache.async.check.orphan.timeout`             | `1h`                              | Uncommitted writes older than this threshold are treated as abandoned and cleaned up.                                                 |
| `alluxio.write.cache.async.file.check.period`                | `10min`                           | Scan interval for orphan detection. Shorter intervals increase FDB load.                                                              |
| `alluxio.worker.page.store.pinned.file.capacity.limit.ratio` | `0.3`                             | Maximum fraction of cache capacity for unpersisted (pinned) data. The remaining capacity is available for read cache (LRU-evictable). |

### Async Persistence Retry

For `WRITE_BACK` paths, failed UFS uploads are retried with exponential backoff. Retries run in background threads and do not block front-end write acknowledgments.

| Property                                                          | Default | Description                                   |
| ----------------------------------------------------------------- | ------- | --------------------------------------------- |
| `alluxio.worker.write.cache.async.persist.retry.initial.interval` | `1s`    | Initial retry wait.                           |
| `alluxio.worker.write.cache.async.persist.retry.max.interval`     | `1h`    | Maximum retry wait (caps exponential growth). |

### Cache Space Management

Worker cache space is divided into two logical regions:

* **Pinned space** (write cache): unpersisted dirty data — not evictable. Capped at 30% of total capacity by default (`alluxio.worker.page.store.pinned.file.capacity.limit.ratio`).
* **Evictable space** (read cache): persisted or UFS-loaded data — evicted LRU when space is needed.

If persistence throughput falls behind write traffic, pinned space fills up and Alluxio returns `out-of-space` errors. To prevent this:

* Ensure `alluxio.write.cache.async.persist.thread.pool.size` is sufficient for your write rate
* Monitor pinned space usage and adjust `alluxio.worker.page.store.pinned.file.capacity.limit.ratio` if needed
* Allocate adequate NVMe capacity for both regions

## Performance Reference

Reference numbers for `WRITE_BACK` on AWS `c5n.metal` clients + `i3en.metal` workers (100 Gbps network, NVMe SSD). Actual results vary by hardware, object size, and concurrency.

| Workload                                     | Write Cache                  | Direct S3            |
| -------------------------------------------- | ---------------------------- | -------------------- |
| Small object PUT (10 KB), low concurrency    | 3–5 ms                       | 30–60 ms             |
| Small object PUT (10 KB), medium concurrency | 4–9 ms                       | 30–60 ms             |
| Large object PUT (10 MB), single worker      | 3–6 GB/s sustained           | Variable (throttled) |
| GET after write (read-after-write latency)   | 3–7 ms                       | 90–130 ms            |
| Async persistence throughput                 | \~2,000 objects/s per worker | —                    |

Front-end write latency for `WRITE_BACK` is bounded by **local NVMe**, not UFS. Throughput scales near-linearly with additional workers.

## Uninstall

To remove the Write Cache configuration and FDB resources (reverse order of setup):

**1. Delete the AlluxioCluster** (removes workers, coordinator, and FDB pods):

```shell
kubectl delete -f alluxio-cluster.yaml
```

**2. Verify all Alluxio pods are removed:**

```shell
kubectl get pods -n <NAMESPACE>
```

**✅ Success:** `No resources found in <NAMESPACE> namespace.`

To disable Write Cache without deleting the cluster, set `alluxio.write.cache.enabled: "false"` in `alluxio-cluster.yaml` and re-apply:

```shell
kubectl apply -f alluxio-cluster.yaml  # Idempotent
```

## Troubleshooting

**FDB connection failure on startup** — FDB pods are not reachable from the workers.

```shell
kubectl get pods -n <NAMESPACE> -l foundationdb.org/fdb-cluster-name=alluxio-cluster-fdb-meta
```

Verify `alluxio.foundationdb.cluster.file.path` points to a valid FDB cluster file. When deployed via Operator, this is auto-injected.

***

**FDB operator OOM / high memory usage** — `globalMode: enabled: true` (the default) causes the FDB operator to watch all Pods, PVCs, ConfigMaps, and Services across the entire cluster, which can spike memory to several GBs in large clusters.

Fix: move the Alluxio Operator, FDB Operator, and AlluxioCluster into the **same namespace**, set `globalMode.enabled: false` in `alluxio-operator.yaml`, and restart the FDB operator pod.

***

**Out-of-space errors on write** — pinned (unpersisted) data has filled the write cache.

Fix: increase `alluxio.worker.page.store.pinned.file.capacity.limit.ratio`, add NVMe capacity, or increase `alluxio.write.cache.async.persist.thread.pool.size` so persistence keeps up with writes.

***

**WRITE\_BACK data not appearing in UFS** — verify async persistence threads are running:

```shell
kubectl logs -n <NAMESPACE> -l app.kubernetes.io/component=worker --tail=100 | grep -i persist
```

Also check `alluxio.worker.write.cache.async.persist.retry.max.interval` — if UFS is unreachable, retries may be in a long backoff cycle.

***

**Orphan files accumulating** — uncommitted writes left by crashed clients. Reduce `alluxio.write.cache.async.check.orphan.timeout` to clean them up faster, or run:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio job free
```

***

**pathconfig not taking effect** — verify the policy resolved correctly:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio pathconfig test --path <YOUR_PATH>
```

If the path still shows the old policy, check coordinator logs for config reload activity.

***

**Data not evicted after persistence** — eviction only triggers when cache is under pressure. To proactively free space:

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- alluxio job free
```

## See Also

* [S3 API](https://documentation.alluxio.io/ee-ai-en/data-access/s3-api) — base endpoint, auth, load balancer, and client compatibility (required before enabling Write Cache)
* [S3 UFS Integration](https://documentation.alluxio.io/ee-ai-en/ufs/s3) — tuning the underlying S3 persistence layer (upload threads, multipart settings)
* [Benchmarking S3 API Performance](https://documentation.alluxio.io/ee-ai-en/benchmark/benchmarking-s3-api-performance) — performance baselines and tuning for S3 API workloads
