POSIX API Benchmarks
Fio (Flexible I/O Tester) is a powerful open-source tool for benchmarking storage systems. Because Alluxio can be mounted as a POSIX-compliant filesystem using FUSE, fio is an excellent tool for measuring its read/write IOPS and throughput.
How to read this guide:
Single-node evaluation — start at Prerequisites and stop at Cleanup. Covers one FUSE client against one Alluxio worker.
Multi-node / large-cluster tuning — after completing the single-node flow, continue into Advanced: Scale-out Benchmarking. Covers the two file-access patterns,
fio --server/--client, multi-worker scaling, and 200 Gbps reference numbers.
Before You Start
The client node is the machine (or pod) where fio runs and the Alluxio FUSE filesystem is mounted. In Kubernetes deployments this is an application pod that mounts the FUSE PVC — see POSIX API for pod setup. Confirm all of the items below before running any benchmark command.
Preparing for the Benchmark
Generating Test Data
Use fio itself to write the test file through the FUSE mount. This persists the file to UFS. Note that Alluxio's default write type is THROUGH (no cache on write), so this step does not populate the Alluxio cache — run the alluxio job load step in Preparing Cache State before any warm-cache read benchmark.
Preparing Cache State
Before each read benchmark, control the cache state to isolate what you are measuring.
Always drop the FUSE-side kernel page cache before every read run.
Run this on the client node before each warm-cache and cold-cache benchmark below.
Warm cache (measures Alluxio cache hit performance):
Because fio writes go through to UFS without populating the Alluxio cache, you must explicitly load the data. Submit a load job and poll --progress until the Job State reports SUCCEEDED (typically takes seconds to minutes depending on dataset size):
Cold cache (measures backend fetch + cache-fill performance):
Parameters and Test Design
Fixed Parameters
Keep these constant across a sweep so results are comparable.
ioengine(recommended:io_uring, orlibaioas fallback) — Linux async I/O backend.io_uringon kernel 5.1+ delivers higher throughput and lower latency.direct(recommended:1) — requestsO_DIRECTat open time to bypass the client's page cache.group_reporting(recommended: always set) — aggregates per-thread results into a single summary.runtime(recommended:60) — 60 seconds per step is long enough for steady-state without stretching sweeps too long.
Control Parameters
Adjust these based on what you are measuring.
rw— I/O pattern.readfor sequential throughput;randreadfor random IOPS.bs— block size.256Kto1Mfor throughput tests,4kfor IOPS tests. Don't exceed1M— oversized blocks (4M+) reduce operation count, introduce head-of-line blocking, and mask the stack's real throughput characteristics.numjobs— number of concurrent I/O threads. Start at1and sweep upward (1, 2, 4, 8, 16, 32, 64, 128, 256, 512). Stop at the p99-latency knee, not at the numeric BW peak — p99 degrades before throughput does.iodepth— outstanding I/Os per thread.1is typical;32combined with pinning each thread to its own file is used to maximize single-thread throughput. See Two File-Access Patterns for the trade-off.size— size of each test file.
Choosing File Layout
The number of test files is one of the most important design choices — it determines which ceiling you actually measure.
Single large file (e.g. one 100 GiB file). Measures per-file throughput, bounded by Alluxio FUSE's per-file serialization. Use for quick sanity checks and establishing the per-file ceiling.
Multiple files (e.g. 10–20 files of 5–10 GiB each) — recommended for headline benchmarks. Measures the cluster/worker throughput ceiling, with each file served by an independent stream. Peak is typically 10–15% higher than the single-file case.
Many small files (e.g. 100+ files). Approximates AI/ML training I/O, where handle management and metadata operations factor in. For peak-throughput measurement, prefer the multi-file option above.
To benchmark against multiple files, create them first with fio, then point fio at the set via a colon-separated list:
fio distributes jobs across the files round-robin — keep the file count ≥ numjobs to avoid reintroducing per-file lock contention.
For multi-worker or multi-client setups, two complementary file-access patterns are worth running — see Two File-Access Patterns in the Advanced section.
Benchmark Commands
Sequential Read Throughput
✅ Success: Look at the BW (bandwidth) value in the output. With warm cache on the hardware below, expect ~9 GiB/s at 32 threads.
Random Read IOPS
✅ Success: Look at the IOPS value. With warm cache on the hardware below, expect ~70k IOPS at 32 threads.
Scaling Concurrency
To find the throughput ceiling, run the same command with increasing --numjobs:
Interpreting the scan:
If
BWdoes not increase when doublingnumjobs, you have hit a bottleneck (network, disk, or CPU).P99 latency degrades before BW does. Once BW stops scaling linearly, check
clat 99.00th— if it has already jumped significantly, you are past the sweet spot. Pick thenumjobsvalue at the knee, not at the numeric peak.
Interpreting Results
A typical fio output looks like:
How to read it:
BW
Aggregated throughput across all threads
Primary metric for throughput tests
IOPS
I/O operations per second
Primary metric for IOPS tests
clat avg
Average completion latency
Lower is better; high values suggest contention
clat 99.00th
P99 tail latency
Spikes here indicate inconsistent performance
CPU % per GB/s
CPU efficiency of the FUSE stack (compute top CPU% ÷ BW)
High values indicate FUSE thread or JVM overhead on the worker
Rules of thumb:
If
clat P99is >10x theavg, investigate tail latency sources (GC pauses, UFS fallback, FUSE thread starvation).Cold cache throughput is typically 10-50x lower than warm cache. If the gap is smaller, the "warm" test may be hitting UFS due to insufficient cache loading.
Baseline Results
The reference numbers below are for a single-node NVMe setup. For high-throughput multi-node results on 200 Gbps instances, see 200 Gbps DRAM Scale-out Baseline in the Advanced section.
Only read results are shown. Alluxio's default write-through policy bypasses the cache on writes, so write throughput measured by fio reflects UFS performance rather than Alluxio cache performance. To benchmark Alluxio-accelerated writes, enable and tune the write cache first — see S3 Write Cache.
Single-node NVMe Baseline
Collected with:
Alluxio Worker Node: AWS
i3en.metal, 8 NVMe SSDs in RAID 0, Ubuntu 24.04Alluxio Client Node: AWS
c5n.metal, Ubuntu 24.04, FUSE 3.16.2+Single worker, single client, warm cache, single 100 GiB file, all instances in the same AWS availability zone.
Throughput (1M block size)
Sequential Read
2101 MiB/s
9519 MiB/s
8089 MiB/s
Random Read
202 MiB/s
6684 MiB/s
8276 MiB/s
IOPS (4k block size)
Sequential Read
55.9k
253k
192k
Random Read
2.3k
70.1k
162k
Throughput dropping at 128 threads (sequential read) indicates CPU saturation on the client node. This is expected behavior — the optimal concurrency depends on your hardware.
Troubleshooting
Most read-path issues that show up during a benchmark — UFS fallback, cache not warm, FUSE mount-option tuning, tail-latency spikes — are generic FUSE concerns rather than benchmark-specific. The authoritative troubleshooting lives in the POSIX API doc:
Throughput does not scale with numjobs
numjobsLikely cause: Network or CPU bottleneck on the client host.
How to diagnose: See POSIX API → Performance → Diagnosing Where You Are Bottlenecked for the symptom-to-layer mapping.
Cleanup
Remove the test data and free the cache entry after benchmarking. The glob below covers both the single-file layout (100gb) and the multi-file layout (file00..file19).
Advanced: Scale-out Benchmarking
This section covers what changes when you go from "one FUSE client, one worker" to "multiple workers, multiple clients". Complete the single-node flow above before continuing.
Two File-Access Patterns
Two complementary ways to drive concurrency, each representing a different real-world access pattern. Run both and take the higher peak as the cluster's headline throughput ceiling. Whichever pattern matches your production workload is the one to optimize against.
Shared-files pattern — many threads read the same set of files concurrently. Matches workloads where files are shared across clients or data locality is unknown (multi-tenant shared reads).
Distinct-files pattern — each thread (or small thread group) is pinned to its own file with a deep I/O queue. Matches sharded workloads — a canonical example is an AI training data pipeline where each GPU worker reads its own data shard.
Pattern
iodepth
File distribution
Peak numjobs
Shared-files
1
All threads read all files
Higher (32–256)
Distinct-files
32
Each file served by one thread group
Lower (4–16)
With iodepth=1 each thread issues one outstanding I/O at a time; concurrency scales by adding threads, so the shared-files pattern reaches peak at high numjobs. With iodepth=32 a single thread pinned to a file can already saturate a large fraction of network bandwidth; the distinct-files pattern reaches peak at low numjobs, and beyond that point compounding queue depth drives latency up sharply without adding throughput.
Shared-files command — one fio section, all threads share the same colon-separated file list:
Distinct-files command — use a multi-section fio job file so each file is owned by a distinct section. Generate the job file based on how numjobs compares to file count:
numjobs ≤ num_files→numjobssections, each withnumjobs=1and files assigned round-robin.numjobs > num_files→num_filessections, each withnumjobs=numjobs/num_filespinned to one file.
Example generator for the common case (nj ≤ num_files):
Scaling Across Multiple Clients
The typical scale-out progression: a single FUSE client first, then additional FUSE clients to push the cluster to saturation, then additional Alluxio workers once one worker's ceiling is reached. Move to multiple FUSE client nodes when one client can no longer drive the measured single-worker ceiling.
fio has a built-in distributed mode. Start fio --server on each FUSE client node, then dispatch jobs from a controller:
Results from all clients are aggregated into a single summary. Parsing note: the JSON from fio --client puts per-host results under client_stats[] — not jobs[] as in single-node mode. Aggregate with sum(bw) per client and a total_ios-weighted average for latency.
Three pitfalls with fio --server / --client:
pkill fiowill kill your SSH session too if you are SSH'd in as the same user. Clean up withpkill -f 'fio --server'or kill by PID.--daemonizemode hangs in practice. Usenohup ... &as shown above.Use private IPs, not public DNS in
host_list.txt— public-path routing can cap bandwidth below the VPC spine.
Scaling Across Multiple Workers
When adding Alluxio workers, expect near-linear throughput scaling at low-to-mid concurrency and progressively lower efficiency as numjobs grows. The notation xW/yC below means x Alluxio workers × y FUSE client nodes (e.g., 3W/3C = 3 workers + 3 clients):
3W/3C
~96% across all numjobs
~96% across all numjobs
10W/10C
99–121% at numjobs ≤ 16, drops to 48–80% at numjobs ≥ 32
~86–99% across most numjobs
The sharp efficiency drop for the shared-files pattern at 10W/10C high concurrency is FUSE file-level lock contention: too many threads on too few shared files. If you are running the shared-files pattern at 10+ workers, increase the file count to 100+ (from the 20 files used in this guide) to push the ceiling higher. The distinct-files pattern is less sensitive because each thread group owns a distinct file.
Long-running benchmarks. A full sweep over both patterns across 1W → 3W → 10W can take hours and will span multiple SSH timeouts. Wrap your driver script with nohup and trap '' HUP and run it in the background so it survives disconnects:
200 Gbps DRAM Scale-out Baseline
Collected with:
Worker / Client / Coordinator: AWS
r6in.32xlarge(128 vCPU, 1024 GiB RAM, 200 Gbps ENA), Ubuntu 22.04, Alluxio AI-3.8-15.1.2Page store: DRAM on independent tmpfs (300–500 GiB per worker)
Dataset: 20 × 10 GiB files (200 GiB total), S3-backed, pre-loaded via
alluxio job loadFIO: 256 KiB block size,
io_uringengine,direct=1, 60 s per step
Peak sequential read throughput:
1W/1C
14.0 / 111.8
14.4 / 115.3
3W/3C
37.6 / 300.4
41.5 / 331.6
10W/10C
89.7 / 717.3
131.3 / 1050.0
Shared-files peaks at numjobs=256 (1W/1C, 3W/3C) or numjobs=32 (10W/10C). Distinct-files peaks at numjobs=8 across all topologies.
Expected peak vs NIC bandwidth. A well-tuned FUSE client with a warm DRAM page store typically peaks at 55–65% of the per-node NIC — on a 200 Gbps NIC that is roughly 14 GiB/s per client, before adding more clients. If your single-client numbers are substantially below this, check the JVM heap / direct memory sizing and FUSE mount options (max_idle_threads, max_background) before assuming worker or UFS bottlenecks. The 10W/10C numbers in the table above were measured without a cluster placement group — modern VPC spine bandwidth within a single AZ is usually sufficient at that scale. Basic network-placement guidance applies to every run and is covered in Before You Start.
See Also
POSIX API via FUSE — FUSE mount setup and tuning (including mount options table)
fio documentation — full reference for every fio flag and ioengine
Last updated