S3 API Benchmarks

Scope

TL;DR

  • Three benchmark tools covered by dedicated pages: COSBench (complex mixed workloads), Warp (quick bucket-wide reads), and httpbench (a ~50-line Go tool for per-worker measurement on redirect-mode clusters).

  • Three reference performance baselines on different hardware (AWS 4-node COSBench, OCI 6-node Warp, AWS 6-node httpbench).

  • Object size matters. For 1+ GiB objects (typical AI model shards), Alluxio's HTTP 307 redirect costs essentially 0% throughput once keep-alive is warm. For small objects (<100 KiB), the handshake dominates and can halve throughput — the two patterns diverge.

  • Key potential performance bottlenecks: network bandwidth, TCP connection reuse, HTTP redirect cost (small objects only), kernel tuning.

For how Alluxio's S3 API works (request flow, consistent hashing, redirects), see How It Works.

Choosing a Benchmark Tool

Best for

Complex, multi-stage workloads

Quick bucket-wide single-operation test

Per-worker throughput in redirect-mode clusters

Setup

Controller + driver nodes

Single binary

~50 lines of Go, build on client

Workload definition

XML config files

CLI flags

URL list on command line

Follows HTTP 307

Yes (via SDK) — needs alluxio.worker.s3.redirect.enabled=true

No — incompatible with redirect mode

Yes (Go default) — works with either pattern

Multi-client coordination

Built-in driver model

--syncstart

ssh + timestamp coordination

Results UI

Web dashboard

Terminal output

Terminal output

Rule of thumb:

  • Redirect enabled + per-worker isolation needed (e.g. CPU-profiling one worker, measuring a single NIC): use httpbench.

  • Redirect disabled + want a quick bucket-wide number: use Warp.

  • Mixed read/write, multiple drivers, long-running, or complex staged workloads: use COSBench.

Reference Performance Baselines

All throughput numbers below assume data is fully cached in Alluxio. If data is served from the underlying UFS, throughput will be significantly lower. Verify with bin/alluxio fs check-cached /path before testing.

4 Node COSBench on AWS

The following results were achieved using a 4-driver COSBench cluster testing a 4-worker Alluxio cluster.

Component
Configuration

COSBench Controller

1 × c5n.metal

COSBench Drivers

4 × c5n.metal

Alluxio Coordinator

1 node

Alluxio Workers

4 × i3en.metal (8 NVMe SSDs each)

Load Balancer

AWS ELB across 4 workers

Large file read throughput (1 GB files) — bandwidth-bound, scales with concurrency until network saturation:

Concurrency per Driver
Total Throughput

1 thread

2.35 GB/s

16 threads

20.44 GB/s

128 threads

36.94 GB/s

Small file read IOPS (100 KB files) — IOPS-bound, scales with concurrency until CPU saturation:

Concurrency per Driver
Total Throughput
Total Operations/sec

1 thread

50.26 MB/s

502 op/s

16 threads

1.10 GB/s

11,302 op/s

128 threads

4.69 GB/s

46,757 op/s

6 Node Warp on OCI

In Warp GET tests on OCI BM.DenseIO.E5.128 nodes (100 Gbps networking, 12 × NVMe in RAID 0), Alluxio achieved 11.2 GiB/s on a single node (0.3 ms avg latency, 0.4 ms P99) and 33.3 GiB/s on 6 nodes (0.6 ms avg, 0.9 ms P99). Note that Warp does not follow HTTP 307 redirects to non-AWS endpoints, so these numbers reflect Pattern B: Load Balancer + Proxy Mode (proxy mode via load balancer). See Alluxio on OCIarrow-up-right for full results.

6 Node httpbench on AWS

Run on 6 × c5n.18xlarge workers (72 vCPU, 192 GiB RAM, 100 Gbps NIC, 80 GiB tmpfs page store) + 6 × c5n.18xlarge clients, serving 82 safetensor files (~137 GB, 1–3 GiB per file) fully cached in Alluxio. Tool: httpbench.

Single worker, single client (1:1), 1–3 GiB objects:

Concurrency
Throughput
vs iperf3 ceiling

1

0.62 GB/s (5.0 Gbps)

AWS ENA per-flow cap

16

8.66 GB/s (69.3 Gbps)

87%

32

11.35 GB/s (90.8 Gbps)

95%

64

11.15 GB/s (89.2 Gbps)

saturated

128

10.46 GB/s (83.7 Gbps)

connection-count overhead

A single worker's S3 API can deliver within ~5% of TCP line rate to a single client when reads are local.

6 clients × 6 workers paired aggregate (C=32 per client, 30s):

Metric
Value

Per-pair average

11.43 GB/s (91.4 Gbps)

Aggregate throughput

68.55 GB/s (548 Gbps)

Worker CPU avg (72-vCPU, mpstat -P ALL)

4.1% ≈ 3 cores avg

Worker CPU peak

7.2–8.2% ≈ 5–6 cores peak

Throughput scales near-linearly from per-pair 11.4 GB/s to 6-pair 68.5 GB/s. Worker CPU is idle during the aggregate — Alluxio is NIC-bound, not CPU-bound, for large-object reads.

Network Ceiling (iperf3 Baseline)

Raw TCP between a client and a worker on the same test bed, for reference:

TCP streams
Throughput

1

4.97 Gbps (0.62 GB/s) — AWS ENA single-flow cap

8

37.9 Gbps

32

95.6 Gbps (11.95 GB/s) — ~100 Gbps NIC line rate

Any per-client S3 API number above ~95 Gbps on this hardware is impossible regardless of cluster size — the NIC is the ceiling. Always establish this ceiling with iperf3 -c <worker> -P 32 before interpreting S3 API numbers.

307 Redirect Cost: Large vs Small Objects

The "Pattern B (proxy-mode) ≈ 50% of Pattern A (redirect)" guidance in the S3 API documentation applies to small-object workloads where the 307 handshake cost dominates each request. For large-object sequential reads (1+ GiB shards, the common case for AI model loading), the handshake happens once and amortises to near-zero — Pattern A and Pattern B throughput were within ~2% of each other in our tests (Pattern B 11.28 GB/s vs Pattern A 11.02 GB/s at C=64 for 1–3 GiB objects; Pattern B 10.80 GB/s vs Pattern A 10.89 GB/s at C=128).

Rule of thumb: if avg_object_bytes / NIC_bytes_per_sec greatly exceeds the 307 redirect RTT (typically ~1 ms in-AZ), redirect cost is noise. On a 100 Gbps NIC, this means the redirect is essentially free for 100 MB+ objects and negligible (<2%) for 1 GB+ objects; for <1 MB objects, the handshake dominates and Pattern B throughput collapses to roughly 50% of Pattern A.

Performance Tuning and Troubleshooting

For suggested Alluxio configuration parameters and Linux kernel tuning, see S3 API — Performance. For tool-specific issues, see each benchmark page's Troubleshooting section.

Cross-tool symptoms:

  • Small-object throughput ~50% lower than expected — for workloads with <100 KiB objects, redirects are likely disabled (alluxio.worker.s3.redirect.enabled=false, the default), so cross-worker reads are proxied through an intermediate worker. To get full throughput, use Pattern A: set alluxio.worker.s3.redirect.enabled=true with a redirect-capable client. See Deployment Patterns. For large objects (1+ GiB), Pattern A and Pattern B throughput are near-identical — redirect cost is negligible, see 307 Redirect Cost: Large vs Small Objects.

  • Throughput far below baselines — most likely data is not fully cached. Verify with bin/alluxio fs check-cached that files show as cached in Alluxio before testing.

  • Low throughput despite high concurrency — network bottleneck or unbalanced load balancer. Verify 100 Gbps connectivity, same-AZ deployment, and that the load balancer correctly distributes requests evenly across all Alluxio workers. Establish the NIC ceiling with iperf3 first — see Network Ceiling (iperf3 Baseline).

  • No scaling with added concurrency — CPU or connection pool bottleneck. Check worker CPU utilization and ensure alluxio.worker.s3.connection.keep.alive.enabled is set to true.

  • High tail latency — TCP port exhaustion. Apply kernel tuning (tcp_tw_reuse, tcp_fin_timeout).

  • Throughput plateaus at low level — health check overhead. Disable alluxio.worker.s3.redirect.health.check.enabled for benchmarks.

  • Inconsistent or highly variable results across runs — data not fully cached, or noisy environment (cross-AZ traffic, shared network). Pre-load data and re-run in a dedicated, same-AZ setup.

See Also

Last updated