S3 API Benchmarks
Scope
TL;DR
Three reference performance baselines on different hardware (AWS 4-node COSBench, OCI 6-node Warp, AWS 6-node httpbench).
Object size matters. For 1+ GiB objects (typical AI model shards), Alluxio's HTTP 307 redirect costs essentially 0% throughput once keep-alive is warm. For small objects (<100 KiB), the handshake dominates and can halve throughput — the two patterns diverge.
Key potential performance bottlenecks: network bandwidth, TCP connection reuse, HTTP redirect cost (small objects only), kernel tuning.
For how Alluxio's S3 API works (request flow, consistent hashing, redirects), see How It Works.
Choosing a Benchmark Tool
Best for
Complex, multi-stage workloads
Quick bucket-wide single-operation test
Per-worker throughput in redirect-mode clusters
Setup
Controller + driver nodes
Single binary
~50 lines of Go, build on client
Workload definition
XML config files
CLI flags
URL list on command line
Follows HTTP 307
Yes (via SDK) — needs alluxio.worker.s3.redirect.enabled=true
No — incompatible with redirect mode
Yes (Go default) — works with either pattern
Multi-client coordination
Built-in driver model
--syncstart
ssh + timestamp coordination
Results UI
Web dashboard
Terminal output
Terminal output
Rule of thumb:
Redirect enabled + per-worker isolation needed (e.g. CPU-profiling one worker, measuring a single NIC): use httpbench.
Redirect disabled + want a quick bucket-wide number: use Warp.
Mixed read/write, multiple drivers, long-running, or complex staged workloads: use COSBench.
Reference Performance Baselines
All throughput numbers below assume data is fully cached in Alluxio. If data is served from the underlying UFS, throughput will be significantly lower. Verify with bin/alluxio fs check-cached /path before testing.
4 Node COSBench on AWS
The following results were achieved using a 4-driver COSBench cluster testing a 4-worker Alluxio cluster.
COSBench Controller
1 × c5n.metal
COSBench Drivers
4 × c5n.metal
Alluxio Coordinator
1 node
Alluxio Workers
4 × i3en.metal (8 NVMe SSDs each)
Load Balancer
AWS ELB across 4 workers
Large file read throughput (1 GB files) — bandwidth-bound, scales with concurrency until network saturation:
1 thread
2.35 GB/s
16 threads
20.44 GB/s
128 threads
36.94 GB/s
Small file read IOPS (100 KB files) — IOPS-bound, scales with concurrency until CPU saturation:
1 thread
50.26 MB/s
502 op/s
16 threads
1.10 GB/s
11,302 op/s
128 threads
4.69 GB/s
46,757 op/s
6 Node Warp on OCI
In Warp GET tests on OCI BM.DenseIO.E5.128 nodes (100 Gbps networking, 12 × NVMe in RAID 0), Alluxio achieved 11.2 GiB/s on a single node (0.3 ms avg latency, 0.4 ms P99) and 33.3 GiB/s on 6 nodes (0.6 ms avg, 0.9 ms P99). Note that Warp does not follow HTTP 307 redirects to non-AWS endpoints, so these numbers reflect Pattern B: Load Balancer + Proxy Mode (proxy mode via load balancer). See Alluxio on OCI for full results.
6 Node httpbench on AWS
Run on 6 × c5n.18xlarge workers (72 vCPU, 192 GiB RAM, 100 Gbps NIC, 80 GiB tmpfs page store) + 6 × c5n.18xlarge clients, serving 82 safetensor files (~137 GB, 1–3 GiB per file) fully cached in Alluxio. Tool: httpbench.
Single worker, single client (1:1), 1–3 GiB objects:
1
0.62 GB/s (5.0 Gbps)
AWS ENA per-flow cap
16
8.66 GB/s (69.3 Gbps)
87%
32
11.35 GB/s (90.8 Gbps)
95%
64
11.15 GB/s (89.2 Gbps)
saturated
128
10.46 GB/s (83.7 Gbps)
connection-count overhead
A single worker's S3 API can deliver within ~5% of TCP line rate to a single client when reads are local.
6 clients × 6 workers paired aggregate (C=32 per client, 30s):
Per-pair average
11.43 GB/s (91.4 Gbps)
Aggregate throughput
68.55 GB/s (548 Gbps)
Worker CPU avg (72-vCPU, mpstat -P ALL)
4.1% ≈ 3 cores avg
Worker CPU peak
7.2–8.2% ≈ 5–6 cores peak
Throughput scales near-linearly from per-pair 11.4 GB/s to 6-pair 68.5 GB/s. Worker CPU is idle during the aggregate — Alluxio is NIC-bound, not CPU-bound, for large-object reads.
Network Ceiling (iperf3 Baseline)
Raw TCP between a client and a worker on the same test bed, for reference:
1
4.97 Gbps (0.62 GB/s) — AWS ENA single-flow cap
8
37.9 Gbps
32
95.6 Gbps (11.95 GB/s) — ~100 Gbps NIC line rate
Any per-client S3 API number above ~95 Gbps on this hardware is impossible regardless of cluster size — the NIC is the ceiling. Always establish this ceiling with iperf3 -c <worker> -P 32 before interpreting S3 API numbers.
307 Redirect Cost: Large vs Small Objects
The "Pattern B (proxy-mode) ≈ 50% of Pattern A (redirect)" guidance in the S3 API documentation applies to small-object workloads where the 307 handshake cost dominates each request. For large-object sequential reads (1+ GiB shards, the common case for AI model loading), the handshake happens once and amortises to near-zero — Pattern A and Pattern B throughput were within ~2% of each other in our tests (Pattern B 11.28 GB/s vs Pattern A 11.02 GB/s at C=64 for 1–3 GiB objects; Pattern B 10.80 GB/s vs Pattern A 10.89 GB/s at C=128).
Rule of thumb: if avg_object_bytes / NIC_bytes_per_sec greatly exceeds the 307 redirect RTT (typically ~1 ms in-AZ), redirect cost is noise. On a 100 Gbps NIC, this means the redirect is essentially free for 100 MB+ objects and negligible (<2%) for 1 GB+ objects; for <1 MB objects, the handshake dominates and Pattern B throughput collapses to roughly 50% of Pattern A.
Performance Tuning and Troubleshooting
For suggested Alluxio configuration parameters and Linux kernel tuning, see S3 API — Performance. For tool-specific issues, see each benchmark page's Troubleshooting section.
Cross-tool symptoms:
Small-object throughput ~50% lower than expected — for workloads with <100 KiB objects, redirects are likely disabled (
alluxio.worker.s3.redirect.enabled=false, the default), so cross-worker reads are proxied through an intermediate worker. To get full throughput, use Pattern A: setalluxio.worker.s3.redirect.enabled=truewith a redirect-capable client. See Deployment Patterns. For large objects (1+ GiB), Pattern A and Pattern B throughput are near-identical — redirect cost is negligible, see 307 Redirect Cost: Large vs Small Objects.Throughput far below baselines — most likely data is not fully cached. Verify with
bin/alluxio fs check-cachedthat files show as cached in Alluxio before testing.Low throughput despite high concurrency — network bottleneck or unbalanced load balancer. Verify 100 Gbps connectivity, same-AZ deployment, and that the load balancer correctly distributes requests evenly across all Alluxio workers. Establish the NIC ceiling with
iperf3first — see Network Ceiling (iperf3 Baseline).No scaling with added concurrency — CPU or connection pool bottleneck. Check worker CPU utilization and ensure
alluxio.worker.s3.connection.keep.alive.enabledis set totrue.High tail latency — TCP port exhaustion. Apply kernel tuning (
tcp_tw_reuse,tcp_fin_timeout).Throughput plateaus at low level — health check overhead. Disable
alluxio.worker.s3.redirect.health.check.enabledfor benchmarks.Inconsistent or highly variable results across runs — data not fully cached, or noisy environment (cross-AZ traffic, shared network). Pre-load data and re-run in a dedicated, same-AZ setup.
See Also
COSBench Benchmarks — complex, multi-stage workloads
Warp Benchmarks — quick single-binary, redirect-disabled clusters
httpbench Benchmarks — per-worker, redirect-aware
S3 API Setup and Configuration — deployment patterns, endpoint setup, load balancer configuration, and client examples
S3 UFS Integration — multipart upload tuning, high concurrency settings, and S3 region configuration
Last updated