# POSIX API Benchmarks

Fio (Flexible I/O Tester) is a powerful open-source tool for benchmarking storage systems. Because Alluxio can be mounted as a POSIX-compliant filesystem using FUSE, `fio` is an excellent tool for measuring its read/write IOPS and throughput.

**How to read this guide:**

* **Single-node evaluation** — start at [Prerequisites](#prerequisites) and stop at [Cleanup](#cleanup). Covers one FUSE client against one Alluxio worker.
* **Multi-node / large-cluster tuning** — after completing the single-node flow, continue into [Advanced: Scale-out Benchmarking](#advanced-scale-out-benchmarking). Covers the two file-access patterns, `fio --server/--client`, multi-worker scaling, and 200 Gbps reference numbers.

## Before You Start

The client node is the machine (or pod) where `fio` runs and the Alluxio FUSE filesystem is mounted. In Kubernetes deployments this is an application pod that mounts the FUSE PVC — see [POSIX API](/ee-ai-en/data-access/fuse-based-posix-api.md) for pod setup. Confirm all of the items below before running any benchmark command.

* [ ] **`fio` is installed** on the client node:

  ```shell
  sudo apt-get install fio   # Debian/Ubuntu
  sudo yum install fio       # RHEL/CentOS (or: sudo dnf install fio)
  ```
* [ ] **FUSE mount is accessible.** The examples below use `/mnt/alluxio` as the mount point and `/s3` as the Alluxio path for the backing store — adjust to match your deployment. Verify:

  ```shell
  ls /mnt/alluxio/s3/
  ```
* [ ] **Paths are consistent** between `fio --filename` and any `alluxio job load/free --path` commands. The Alluxio path `/s3/testdata` maps to `/mnt/alluxio/s3/testdata` on the FUSE mount.
* [ ] **Know the actual bandwidth between your nodes.** Run `iperf3` between every pair of FUSE and worker nodes before the sweep. Numbers noticeably below NIC line rate will cap benchmark throughput regardless of Alluxio tuning. On AWS with ≤3 nodes, use a cluster placement group to avoid \~20% bandwidth loss; on larger clusters (10+), modern VPC spine bandwidth within a single AZ is usually sufficient.

## Preparing for the Benchmark

### Generating Test Data

Use `fio` itself to write the test file through the FUSE mount. This persists the file to UFS. Note that Alluxio's default write type is `THROUGH` (no cache on write), so this step does **not** populate the Alluxio cache — run the `alluxio job load` step in [Preparing Cache State](#preparing-cache-state) before any warm-cache read benchmark.

```shell
fio --ioengine=libaio --direct=1 --group_reporting --runtime=120 \
    --rw=write --bs=1M --numjobs=16 --size=100G \
    --filename=/mnt/alluxio/s3/testdata/100gb --name=data_prep
```

### Preparing Cache State

Before each read benchmark, control the cache state to isolate what you are measuring.

{% hint style="warning" %}
**Always drop the FUSE-side kernel page cache before every read run.**

```shell
echo 3 | sudo tee /proc/sys/vm/drop_caches
```

Run this on the client node before each warm-cache and cold-cache benchmark below.
{% endhint %}

**Warm cache** (measures Alluxio cache hit performance):

Because fio writes go through to UFS without populating the Alluxio cache, you must explicitly load the data. Submit a load job and poll `--progress` until the Job State reports `SUCCEEDED` (typically takes seconds to minutes depending on dataset size):

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path /s3/testdata --submit
# Poll until SUCCEEDED (re-run this command every ~10s)
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job load --path /s3/testdata --progress
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job load --path /s3/testdata --submit
# Poll until SUCCEEDED (re-run this command every ~10s)
bin/alluxio job load --path /s3/testdata --progress
```

{% endtab %}
{% endtabs %}

**Cold cache** (measures backend fetch + cache-fill performance):

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job free --path /s3/testdata --submit
echo 3 | sudo tee /proc/sys/vm/drop_caches
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
bin/alluxio job free --path /s3/testdata --submit
echo 3 | sudo tee /proc/sys/vm/drop_caches
```

{% endtab %}
{% endtabs %}

## Parameters and Test Design

### Fixed Parameters

Keep these constant across a sweep so results are comparable.

* **`ioengine`** (recommended: `io_uring`, or `libaio` as fallback) — Linux async I/O backend. `io_uring` on kernel 5.1+ delivers higher throughput and lower latency.
* **`direct`** (recommended: `1`) — requests `O_DIRECT` at open time to bypass the client's page cache.
* **`group_reporting`** (recommended: always set) — aggregates per-thread results into a single summary.
* **`runtime`** (recommended: `60`) — 60 seconds per step is long enough for steady-state without stretching sweeps too long.

### Control Parameters

Adjust these based on what you are measuring.

* **`rw`** — I/O pattern. `read` for sequential throughput; `randread` for random IOPS.
* **`bs`** — block size. `256K` to `1M` for throughput tests, `4k` for IOPS tests. Don't exceed `1M` — oversized blocks (`4M`+) reduce operation count, introduce head-of-line blocking, and mask the stack's real throughput characteristics.
* **`numjobs`** — number of concurrent I/O threads. Start at `1` and sweep upward (1, 2, 4, 8, 16, 32, 64, 128, 256, 512). Stop at the p99-latency knee, not at the numeric BW peak — p99 degrades before throughput does.
* **`iodepth`** — outstanding I/Os per thread. `1` is typical; `32` combined with pinning each thread to its own file is used to maximize single-thread throughput. See [Two File-Access Patterns](#two-file-access-patterns) for the trade-off.
* **`size`** — size of each test file.

### Choosing File Layout

The number of test files is one of the most important design choices — it determines which ceiling you actually measure.

**Single large file** (e.g. one 100 GiB file). Measures per-file throughput, bounded by Alluxio FUSE's per-file serialization. Use for quick sanity checks and establishing the per-file ceiling.

**Multiple files** (e.g. 10–20 files of 5–10 GiB each) — **recommended for headline benchmarks**. Measures the cluster/worker throughput ceiling, with each file served by an independent stream. Peak is typically 10–15% higher than the single-file case.

**Many small files** (e.g. 100+ files). Approximates AI/ML training I/O, where handle management and metadata operations factor in. For peak-throughput measurement, prefer the multi-file option above.

To benchmark against multiple files, create them first with `fio`, then point `fio` at the set via a colon-separated list:

```shell
# Create 20 files of 5 GiB each
for i in $(seq -f "%02g" 0 19); do
  fio --ioengine=libaio --direct=1 --rw=write --bs=1M --numjobs=1 \
      --size=5G --filename=/mnt/alluxio/s3/testdata/file${i} \
      --name=prep_${i}
done

# Read across all 20 files
FILES=$(seq -f "/mnt/alluxio/s3/testdata/file%02g" 0 19 | paste -sd:)
fio --ioengine=libaio --direct=1 --group_reporting --runtime=60 \
    --rw=read --bs=1M --numjobs=64 --size=5G --readonly \
    --filename="$FILES" --name=multi_read
```

`fio` distributes jobs across the files round-robin — keep the file count ≥ `numjobs` to avoid reintroducing per-file lock contention.

> For multi-worker or multi-client setups, two complementary file-access patterns are worth running — see [Two File-Access Patterns](#two-file-access-patterns) in the Advanced section.

## Benchmark Commands

### Sequential Read Throughput

```shell
fio --ioengine=libaio --direct=1 --group_reporting --runtime=60 \
    --rw=read --bs=1M --numjobs=32 --size=100G --readonly \
    --filename=/mnt/alluxio/s3/testdata/100gb --name=seq_read
```

**✅ Success:** Look at the `BW` (bandwidth) value in the output. With warm cache on the hardware below, expect \~9 GiB/s at 32 threads.

### Random Read IOPS

```shell
fio --ioengine=libaio --direct=1 --group_reporting --runtime=60 \
    --rw=randread --bs=4k --numjobs=32 --size=100G --readonly \
    --filename=/mnt/alluxio/s3/testdata/100gb --name=rand_read
```

**✅ Success:** Look at the `IOPS` value. With warm cache on the hardware below, expect \~70k IOPS at 32 threads.

### Scaling Concurrency

To find the throughput ceiling, run the same command with increasing `--numjobs`:

```shell
for j in 1 4 16 32 64 128; do
  echo "=== numjobs=$j ==="
  fio --ioengine=libaio --direct=1 --group_reporting --runtime=60 \
      --rw=read --bs=1M --numjobs=$j --size=100G --readonly \
      --filename=/mnt/alluxio/s3/testdata/100gb --name=scale_$j
done
```

**Interpreting the scan:**

* If `BW` does not increase when doubling `numjobs`, you have hit a bottleneck (network, disk, or CPU).
* **P99 latency degrades before BW does.** Once BW stops scaling linearly, check `clat 99.00th` — if it has already jumped significantly, you are past the sweet spot. Pick the `numjobs` value at the knee, not at the numeric peak.

## Interpreting Results

A typical `fio` output looks like:

```console
seq_read: (groupid=0, jobs=32): err= 0: pid=12345: ...
  read: IOPS=36.2k, BW=9056MiB/s (9496MB/s)(531GiB/60001msec)
    clat (usec): min=12, max=8934, avg=879.23, stdev=312.45
     lat (usec): min=12, max=8935, avg=879.89, stdev=312.51
    clat percentiles (usec):
     | 50.00th=[  816], 90.00th=[ 1254], 95.00th=[ 1434],
     | 99.00th=[ 2008], 99.90th=[ 3556]
```

How to read it:

| Field          | Meaning                                                    | What to check                                                  |
| -------------- | ---------------------------------------------------------- | -------------------------------------------------------------- |
| `BW`           | Aggregated throughput across all threads                   | Primary metric for throughput tests                            |
| `IOPS`         | I/O operations per second                                  | Primary metric for IOPS tests                                  |
| `clat avg`     | Average completion latency                                 | Lower is better; high values suggest contention                |
| `clat 99.00th` | P99 tail latency                                           | Spikes here indicate inconsistent performance                  |
| CPU % per GB/s | CPU efficiency of the FUSE stack (compute `top` CPU% ÷ BW) | High values indicate FUSE thread or JVM overhead on the worker |

**Rules of thumb:**

* If `clat P99` is >10x the `avg`, investigate tail latency sources (GC pauses, UFS fallback, FUSE thread starvation).
* Cold cache throughput is typically 10-50x lower than warm cache. If the gap is smaller, the "warm" test may be hitting UFS due to insufficient cache loading.

## Baseline Results

The reference numbers below are for a single-node NVMe setup. For high-throughput multi-node results on 200 Gbps instances, see [200 Gbps DRAM Scale-out Baseline](#id-200-gbps-dram-scale-out-baseline) in the Advanced section.

Only read results are shown. Alluxio's default write-through policy bypasses the cache on writes, so write throughput measured by `fio` reflects UFS performance rather than Alluxio cache performance. To benchmark Alluxio-accelerated writes, enable and tune the write cache first — see [S3 Write Cache](/ee-ai-en/performance/s3-write-cache.md).

### Single-node NVMe Baseline

Collected with:

* **Alluxio Worker Node**: AWS `i3en.metal`, 8 NVMe SSDs in RAID 0, Ubuntu 24.04
* **Alluxio Client Node**: AWS `c5n.metal`, Ubuntu 24.04, FUSE 3.16.2+
* Single worker, single client, warm cache, single 100 GiB file, all instances in the same AWS availability zone.

#### Throughput (1M block size)

| Bandwidth/Threads | Single Thread | 32 Threads | 128 Threads |
| ----------------- | ------------- | ---------- | ----------- |
| Sequential Read   | 2101 MiB/s    | 9519 MiB/s | 8089 MiB/s  |
| Random Read       | 202 MiB/s     | 6684 MiB/s | 8276 MiB/s  |

#### IOPS (4k block size)

| IOPS/Threads    | Single Thread | 32 Threads | 128 Threads |
| --------------- | ------------- | ---------- | ----------- |
| Sequential Read | 55.9k         | 253k       | 192k        |
| Random Read     | 2.3k          | 70.1k      | 162k        |

{% hint style="info" %}
Throughput dropping at 128 threads (sequential read) indicates CPU saturation on the client node. This is expected behavior — the optimal concurrency depends on your hardware.
{% endhint %}

## Troubleshooting

Most read-path issues that show up during a benchmark — UFS fallback, cache not warm, FUSE mount-option tuning, tail-latency spikes — are generic FUSE concerns rather than benchmark-specific. The authoritative troubleshooting lives in the POSIX API doc:

* [POSIX API → Troubleshooting → Slow read performance](/ee-ai-en/data-access/fuse-based-posix-api.md#slow-read-performance)
* [POSIX API → Performance → Diagnosing Where You Are Bottlenecked](/ee-ai-en/data-access/fuse-based-posix-api.md#diagnosing-where-you-are-bottlenecked)

### Throughput does not scale with `numjobs`

**Likely cause:** Network or CPU bottleneck on the client host.

**How to diagnose:** See [POSIX API → Performance → Diagnosing Where You Are Bottlenecked](/ee-ai-en/data-access/fuse-based-posix-api.md#diagnosing-where-you-are-bottlenecked) for the symptom-to-layer mapping.

## Cleanup

Remove the test data and free the cache entry after benchmarking. The glob below covers both the single-file layout (`100gb`) and the multi-file layout (`file00`..`file19`).

{% tabs %}
{% tab title="Kubernetes (Operator)" %}

```shell
# Delete the test data through FUSE
kubectl exec -it <client-pod> -- rm /mnt/alluxio/s3/testdata/*

# Free the cache entry
kubectl exec -n <NAMESPACE> alluxio-cluster-coordinator-0 -- \
  alluxio job free --path /s3/testdata --submit
```

{% endtab %}

{% tab title="Docker / Bare-Metal" %}

```shell
# Delete the test data through FUSE
rm /mnt/alluxio/s3/testdata/*

# Free the cache entry
bin/alluxio job free --path /s3/testdata --submit
```

{% endtab %}
{% endtabs %}

## Advanced: Scale-out Benchmarking

This section covers what changes when you go from "one FUSE client, one worker" to "multiple workers, multiple clients". Complete the single-node flow above before continuing.

### Two File-Access Patterns

Two complementary ways to drive concurrency, each representing a different real-world access pattern. Run both and take the higher peak as the cluster's headline throughput ceiling. Whichever pattern matches your production workload is the one to optimize against.

* **Shared-files pattern** — many threads read the same set of files concurrently. Matches workloads where files are shared across clients or data locality is unknown (multi-tenant shared reads).
* **Distinct-files pattern** — each thread (or small thread group) is pinned to its own file with a deep I/O queue. Matches sharded workloads — a canonical example is an AI training data pipeline where each GPU worker reads its own data shard.

| Pattern            | `iodepth` | File distribution                    | Peak `numjobs`  |
| ------------------ | --------- | ------------------------------------ | --------------- |
| **Shared-files**   | `1`       | All threads read all files           | Higher (32–256) |
| **Distinct-files** | `32`      | Each file served by one thread group | Lower (4–16)    |

With `iodepth=1` each thread issues one outstanding I/O at a time; concurrency scales by adding threads, so the shared-files pattern reaches peak at high `numjobs`. With `iodepth=32` a single thread pinned to a file can already saturate a large fraction of network bandwidth; the distinct-files pattern reaches peak at low `numjobs`, and beyond that point compounding queue depth drives latency up sharply without adding throughput.

**Shared-files command** — one fio section, all threads share the same colon-separated file list:

```shell
FILES=$(seq -f "/mnt/alluxio/s3/testdata/file%02g" 0 19 | paste -sd:)
fio --ioengine=io_uring --direct=1 --iodepth=1 --group_reporting \
    --runtime=60 --time_based --rw=read --bs=256k --readonly \
    --numjobs=256 --filename="$FILES" --name=shared
```

**Distinct-files command** — use a multi-section fio job file so each file is owned by a distinct section. Generate the job file based on how `numjobs` compares to file count:

* **`numjobs ≤ num_files`** → `numjobs` sections, each with `numjobs=1` and files assigned round-robin.
* **`numjobs > num_files`** → `num_files` sections, each with `numjobs=numjobs/num_files` pinned to one file.

Example generator for the common case (`nj ≤ num_files`):

```python
num_files, nj = 20, 8
base_dir = "/mnt/alluxio/s3/testdata"
all_files = [f"{base_dir}/file{i:02d}" for i in range(num_files)]
for s in range(nj):
    files = ":".join(all_files[s::nj])        # round-robin stripe
    print(f"[section-{s}]\nioengine=io_uring\ndirect=1\niodepth=32")
    print(f"rw=read\nbs=256k\nnumjobs=1\nruntime=60\ntime_based")
    print(f"group_reporting\nfilename={files}\n")
```

### Scaling Across Multiple Clients

The typical scale-out progression: a single FUSE client first, then additional FUSE clients to push the cluster to saturation, then additional Alluxio workers once one worker's ceiling is reached. Move to multiple FUSE client nodes when one client can no longer drive the measured single-worker ceiling.

`fio` has a built-in distributed mode. Start `fio --server` on each FUSE client node, then dispatch jobs from a controller:

```shell
# On each FUSE client node — start BEFORE dispatching jobs. Detach with nohup so
# the server keeps running after your SSH session closes.
nohup fio --server </dev/null >/dev/null 2>&1 &

# On the controller node (host_list.txt contains one client PRIVATE IP per line)
fio --client=host_list.txt /tmp/job.fio --readonly --output-format=json
```

Results from all clients are aggregated into a single summary. Parsing note: the JSON from `fio --client` puts per-host results under `client_stats[]` — not `jobs[]` as in single-node mode. Aggregate with `sum(bw)` per client and a `total_ios`-weighted average for latency.

**Three pitfalls with `fio --server` / `--client`:**

* **`pkill fio` will kill your SSH session too** if you are SSH'd in as the same user. Clean up with `pkill -f 'fio --server'` or kill by PID.
* **`--daemonize` mode hangs** in practice. Use `nohup ... &` as shown above.
* **Use private IPs, not public DNS** in `host_list.txt` — public-path routing can cap bandwidth below the VPC spine.

### Scaling Across Multiple Workers

When adding Alluxio workers, expect near-linear throughput scaling at low-to-mid concurrency and progressively lower efficiency as `numjobs` grows. The notation **`xW/yC`** below means *x Alluxio workers × y FUSE client nodes* (e.g., `3W/3C` = 3 workers + 3 clients):

| Topology | Shared-files efficiency                                          | Distinct-files efficiency      |
| -------: | ---------------------------------------------------------------- | ------------------------------ |
|    3W/3C | \~96% across all `numjobs`                                       | \~96% across all `numjobs`     |
|  10W/10C | 99–121% at `numjobs ≤ 16`, **drops to 48–80% at `numjobs ≥ 32`** | \~86–99% across most `numjobs` |

The sharp efficiency drop for the shared-files pattern at 10W/10C high concurrency is FUSE **file-level lock contention**: too many threads on too few shared files. If you are running the shared-files pattern at 10+ workers, **increase the file count to 100+** (from the 20 files used in this guide) to push the ceiling higher. The distinct-files pattern is less sensitive because each thread group owns a distinct file.

{% hint style="info" %}
**Long-running benchmarks.** A full sweep over both patterns across 1W → 3W → 10W can take hours and will span multiple SSH timeouts. Wrap your driver script with `nohup` and `trap '' HUP` and run it in the background so it survives disconnects:

```shell
nohup bash -c 'trap "" HUP; ./run-fio-sweep.sh' > run.log 2>&1 &
```

{% endhint %}

### 200 Gbps DRAM Scale-out Baseline

Collected with:

* **Worker / Client / Coordinator**: AWS `r6in.32xlarge` (128 vCPU, 1024 GiB RAM, 200 Gbps ENA), Ubuntu 22.04, Alluxio AI-3.8-15.1.2
* **Page store**: DRAM on independent tmpfs (300–500 GiB per worker)
* **Dataset**: 20 × 10 GiB files (200 GiB total), S3-backed, pre-loaded via `alluxio job load`
* **FIO**: 256 KiB block size, `io_uring` engine, `direct=1`, 60 s per step

Peak sequential read throughput:

| Topology | Shared-files peak (GiB/s / Gbps) | Distinct-files peak (GiB/s / Gbps) |
| -------: | -------------------------------- | ---------------------------------- |
|    1W/1C | 14.0 / 111.8                     | 14.4 / 115.3                       |
|    3W/3C | 37.6 / 300.4                     | 41.5 / 331.6                       |
|  10W/10C | 89.7 / 717.3                     | **131.3 / 1050.0**                 |

Shared-files peaks at `numjobs=256` (1W/1C, 3W/3C) or `numjobs=32` (10W/10C). Distinct-files peaks at `numjobs=8` across all topologies.

**Expected peak vs NIC bandwidth.** A well-tuned FUSE client with a warm DRAM page store typically peaks at **55–65% of the per-node NIC** — on a 200 Gbps NIC that is roughly 14 GiB/s per client, before adding more clients. If your single-client numbers are substantially below this, check the JVM heap / direct memory sizing and FUSE mount options (`max_idle_threads`, `max_background`) before assuming worker or UFS bottlenecks. The 10W/10C numbers in the table above were measured **without** a cluster placement group — modern VPC spine bandwidth within a single AZ is usually sufficient at that scale. Basic network-placement guidance applies to every run and is covered in [Before You Start](#before-you-start).

## See Also

* [S3 API Benchmarks](/ee-ai-en/benchmark/s3-api.md)
* [Benchmarking ML Training Performance with MLPerf](/ee-ai-en/benchmark/benchmarking-ml-training-performance-with-mlperf.md)
* [POSIX API via FUSE](/ee-ai-en/data-access/fuse-based-posix-api.md) — FUSE mount setup and tuning (including mount options table)
* [fio documentation](https://fio.readthedocs.io/en/latest/fio_doc.html) — full reference for every fio flag and ioengine


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/benchmark/benchmarking-posix-performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
