POSIX API Benchmarks

Fio (Flexible I/O Tester) is a powerful open-source tool for benchmarking storage systems. Because Alluxio can be mounted as a POSIX-compliant filesystem using FUSE, fio is an excellent tool for measuring its read/write IOPS and throughput.

How to read this guide:

  • Single-node evaluation — start at Prerequisites and stop at Cleanup. Covers one FUSE client against one Alluxio worker.

  • Multi-node / large-cluster tuning — after completing the single-node flow, continue into Advanced: Scale-out Benchmarking. Covers the two file-access patterns, fio --server/--client, multi-worker scaling, and 200 Gbps reference numbers.

Before You Start

The client node is the machine (or pod) where fio runs and the Alluxio FUSE filesystem is mounted. In Kubernetes deployments this is an application pod that mounts the FUSE PVC — see POSIX API for pod setup. Confirm all of the items below before running any benchmark command.

Preparing for the Benchmark

Generating Test Data

Use fio itself to write the test file through the FUSE mount. This persists the file to UFS. Note that Alluxio's default write type is THROUGH (no cache on write), so this step does not populate the Alluxio cache — run the alluxio job load step in Preparing Cache State before any warm-cache read benchmark.

Preparing Cache State

Before each read benchmark, control the cache state to isolate what you are measuring.

circle-exclamation

Warm cache (measures Alluxio cache hit performance):

Because fio writes go through to UFS without populating the Alluxio cache, you must explicitly load the data. Submit a load job and poll --progress until the Job State reports SUCCEEDED (typically takes seconds to minutes depending on dataset size):

Cold cache (measures backend fetch + cache-fill performance):

Parameters and Test Design

Fixed Parameters

Keep these constant across a sweep so results are comparable.

  • ioengine (recommended: io_uring, or libaio as fallback) — Linux async I/O backend. io_uring on kernel 5.1+ delivers higher throughput and lower latency.

  • direct (recommended: 1) — requests O_DIRECT at open time to bypass the client's page cache.

  • group_reporting (recommended: always set) — aggregates per-thread results into a single summary.

  • runtime (recommended: 60) — 60 seconds per step is long enough for steady-state without stretching sweeps too long.

Control Parameters

Adjust these based on what you are measuring.

  • rw — I/O pattern. read for sequential throughput; randread for random IOPS.

  • bs — block size. 256K to 1M for throughput tests, 4k for IOPS tests. Don't exceed 1M — oversized blocks (4M+) reduce operation count, introduce head-of-line blocking, and mask the stack's real throughput characteristics.

  • numjobs — number of concurrent I/O threads. Start at 1 and sweep upward (1, 2, 4, 8, 16, 32, 64, 128, 256, 512). Stop at the p99-latency knee, not at the numeric BW peak — p99 degrades before throughput does.

  • iodepth — outstanding I/Os per thread. 1 is typical; 32 combined with pinning each thread to its own file is used to maximize single-thread throughput. See Two File-Access Patterns for the trade-off.

  • size — size of each test file.

Choosing File Layout

The number of test files is one of the most important design choices — it determines which ceiling you actually measure.

Single large file (e.g. one 100 GiB file). Measures per-file throughput, bounded by Alluxio FUSE's per-file serialization. Use for quick sanity checks and establishing the per-file ceiling.

Multiple files (e.g. 10–20 files of 5–10 GiB each) — recommended for headline benchmarks. Measures the cluster/worker throughput ceiling, with each file served by an independent stream. Peak is typically 10–15% higher than the single-file case.

Many small files (e.g. 100+ files). Approximates AI/ML training I/O, where handle management and metadata operations factor in. For peak-throughput measurement, prefer the multi-file option above.

To benchmark against multiple files, create them first with fio, then point fio at the set via a colon-separated list:

fio distributes jobs across the files round-robin — keep the file count ≥ numjobs to avoid reintroducing per-file lock contention.

For multi-worker or multi-client setups, two complementary file-access patterns are worth running — see Two File-Access Patterns in the Advanced section.

Benchmark Commands

Sequential Read Throughput

✅ Success: Look at the BW (bandwidth) value in the output. With warm cache on the hardware below, expect ~9 GiB/s at 32 threads.

Random Read IOPS

✅ Success: Look at the IOPS value. With warm cache on the hardware below, expect ~70k IOPS at 32 threads.

Scaling Concurrency

To find the throughput ceiling, run the same command with increasing --numjobs:

Interpreting the scan:

  • If BW does not increase when doubling numjobs, you have hit a bottleneck (network, disk, or CPU).

  • P99 latency degrades before BW does. Once BW stops scaling linearly, check clat 99.00th — if it has already jumped significantly, you are past the sweet spot. Pick the numjobs value at the knee, not at the numeric peak.

Interpreting Results

A typical fio output looks like:

How to read it:

Field
Meaning
What to check

BW

Aggregated throughput across all threads

Primary metric for throughput tests

IOPS

I/O operations per second

Primary metric for IOPS tests

clat avg

Average completion latency

Lower is better; high values suggest contention

clat 99.00th

P99 tail latency

Spikes here indicate inconsistent performance

CPU % per GB/s

CPU efficiency of the FUSE stack (compute top CPU% ÷ BW)

High values indicate FUSE thread or JVM overhead on the worker

Rules of thumb:

  • If clat P99 is >10x the avg, investigate tail latency sources (GC pauses, UFS fallback, FUSE thread starvation).

  • Cold cache throughput is typically 10-50x lower than warm cache. If the gap is smaller, the "warm" test may be hitting UFS due to insufficient cache loading.

Baseline Results

The reference numbers below are for a single-node NVMe setup. For high-throughput multi-node results on 200 Gbps instances, see 200 Gbps DRAM Scale-out Baseline in the Advanced section.

Only read results are shown. Alluxio's default write-through policy bypasses the cache on writes, so write throughput measured by fio reflects UFS performance rather than Alluxio cache performance. To benchmark Alluxio-accelerated writes, enable and tune the write cache first — see S3 Write Cache.

Single-node NVMe Baseline

Collected with:

  • Alluxio Worker Node: AWS i3en.metal, 8 NVMe SSDs in RAID 0, Ubuntu 24.04

  • Alluxio Client Node: AWS c5n.metal, Ubuntu 24.04, FUSE 3.16.2+

  • Single worker, single client, warm cache, single 100 GiB file, all instances in the same AWS availability zone.

Throughput (1M block size)

Bandwidth/Threads
Single Thread
32 Threads
128 Threads

Sequential Read

2101 MiB/s

9519 MiB/s

8089 MiB/s

Random Read

202 MiB/s

6684 MiB/s

8276 MiB/s

IOPS (4k block size)

IOPS/Threads
Single Thread
32 Threads
128 Threads

Sequential Read

55.9k

253k

192k

Random Read

2.3k

70.1k

162k

circle-info

Throughput dropping at 128 threads (sequential read) indicates CPU saturation on the client node. This is expected behavior — the optimal concurrency depends on your hardware.

Troubleshooting

Most read-path issues that show up during a benchmark — UFS fallback, cache not warm, FUSE mount-option tuning, tail-latency spikes — are generic FUSE concerns rather than benchmark-specific. The authoritative troubleshooting lives in the POSIX API doc:

Throughput does not scale with numjobs

Likely cause: Network or CPU bottleneck on the client host.

How to diagnose: See POSIX API → Performance → Diagnosing Where You Are Bottlenecked for the symptom-to-layer mapping.

Cleanup

Remove the test data and free the cache entry after benchmarking. The glob below covers both the single-file layout (100gb) and the multi-file layout (file00..file19).

Advanced: Scale-out Benchmarking

This section covers what changes when you go from "one FUSE client, one worker" to "multiple workers, multiple clients". Complete the single-node flow above before continuing.

Two File-Access Patterns

Two complementary ways to drive concurrency, each representing a different real-world access pattern. Run both and take the higher peak as the cluster's headline throughput ceiling. Whichever pattern matches your production workload is the one to optimize against.

  • Shared-files pattern — many threads read the same set of files concurrently. Matches workloads where files are shared across clients or data locality is unknown (multi-tenant shared reads).

  • Distinct-files pattern — each thread (or small thread group) is pinned to its own file with a deep I/O queue. Matches sharded workloads — a canonical example is an AI training data pipeline where each GPU worker reads its own data shard.

Pattern

iodepth

File distribution

Peak numjobs

Shared-files

1

All threads read all files

Higher (32–256)

Distinct-files

32

Each file served by one thread group

Lower (4–16)

With iodepth=1 each thread issues one outstanding I/O at a time; concurrency scales by adding threads, so the shared-files pattern reaches peak at high numjobs. With iodepth=32 a single thread pinned to a file can already saturate a large fraction of network bandwidth; the distinct-files pattern reaches peak at low numjobs, and beyond that point compounding queue depth drives latency up sharply without adding throughput.

Shared-files command — one fio section, all threads share the same colon-separated file list:

Distinct-files command — use a multi-section fio job file so each file is owned by a distinct section. Generate the job file based on how numjobs compares to file count:

  • numjobs ≤ num_filesnumjobs sections, each with numjobs=1 and files assigned round-robin.

  • numjobs > num_filesnum_files sections, each with numjobs=numjobs/num_files pinned to one file.

Example generator for the common case (nj ≤ num_files):

Scaling Across Multiple Clients

The typical scale-out progression: a single FUSE client first, then additional FUSE clients to push the cluster to saturation, then additional Alluxio workers once one worker's ceiling is reached. Move to multiple FUSE client nodes when one client can no longer drive the measured single-worker ceiling.

fio has a built-in distributed mode. Start fio --server on each FUSE client node, then dispatch jobs from a controller:

Results from all clients are aggregated into a single summary. Parsing note: the JSON from fio --client puts per-host results under client_stats[] — not jobs[] as in single-node mode. Aggregate with sum(bw) per client and a total_ios-weighted average for latency.

Three pitfalls with fio --server / --client:

  • pkill fio will kill your SSH session too if you are SSH'd in as the same user. Clean up with pkill -f 'fio --server' or kill by PID.

  • --daemonize mode hangs in practice. Use nohup ... & as shown above.

  • Use private IPs, not public DNS in host_list.txt — public-path routing can cap bandwidth below the VPC spine.

Scaling Across Multiple Workers

When adding Alluxio workers, expect near-linear throughput scaling at low-to-mid concurrency and progressively lower efficiency as numjobs grows. The notation xW/yC below means x Alluxio workers × y FUSE client nodes (e.g., 3W/3C = 3 workers + 3 clients):

Topology
Shared-files efficiency
Distinct-files efficiency

3W/3C

~96% across all numjobs

~96% across all numjobs

10W/10C

99–121% at numjobs ≤ 16, drops to 48–80% at numjobs ≥ 32

~86–99% across most numjobs

The sharp efficiency drop for the shared-files pattern at 10W/10C high concurrency is FUSE file-level lock contention: too many threads on too few shared files. If you are running the shared-files pattern at 10+ workers, increase the file count to 100+ (from the 20 files used in this guide) to push the ceiling higher. The distinct-files pattern is less sensitive because each thread group owns a distinct file.

circle-info

Long-running benchmarks. A full sweep over both patterns across 1W → 3W → 10W can take hours and will span multiple SSH timeouts. Wrap your driver script with nohup and trap '' HUP and run it in the background so it survives disconnects:

200 Gbps DRAM Scale-out Baseline

Collected with:

  • Worker / Client / Coordinator: AWS r6in.32xlarge (128 vCPU, 1024 GiB RAM, 200 Gbps ENA), Ubuntu 22.04, Alluxio AI-3.8-15.1.2

  • Page store: DRAM on independent tmpfs (300–500 GiB per worker)

  • Dataset: 20 × 10 GiB files (200 GiB total), S3-backed, pre-loaded via alluxio job load

  • FIO: 256 KiB block size, io_uring engine, direct=1, 60 s per step

Peak sequential read throughput:

Topology
Shared-files peak (GiB/s / Gbps)
Distinct-files peak (GiB/s / Gbps)

1W/1C

14.0 / 111.8

14.4 / 115.3

3W/3C

37.6 / 300.4

41.5 / 331.6

10W/10C

89.7 / 717.3

131.3 / 1050.0

Shared-files peaks at numjobs=256 (1W/1C, 3W/3C) or numjobs=32 (10W/10C). Distinct-files peaks at numjobs=8 across all topologies.

Expected peak vs NIC bandwidth. A well-tuned FUSE client with a warm DRAM page store typically peaks at 55–65% of the per-node NIC — on a 200 Gbps NIC that is roughly 14 GiB/s per client, before adding more clients. If your single-client numbers are substantially below this, check the JVM heap / direct memory sizing and FUSE mount options (max_idle_threads, max_background) before assuming worker or UFS bottlenecks. The 10W/10C numbers in the table above were measured without a cluster placement group — modern VPC spine bandwidth within a single AZ is usually sufficient at that scale. Basic network-placement guidance applies to every run and is covered in Before You Start.

See Also

Last updated