RDMA Networking

Alluxio supports several high-speed network technologies commonly deployed in AI and HPC clusters. This page covers configuration and performance guidance for each supported option.

Technology
Status
Use Case

IPoIB (IP over InfiniBand)

✅ Supported

Standard TCP/IP over IB hardware, zero code changes

RoCE (RDMA over Converged Ethernet)

Planned

Low-latency RDMA over Ethernet fabric

Native IB API (Verbs)

Planned

Ultra-low latency RDMA over InfiniBand fabric

IPoIB

Overview

InfiniBand (IB) is a high-bandwidth, low-latency interconnect commonly deployed in AI training clusters. Alluxio supports IP over InfiniBand (IPoIB), which runs the standard TCP/IP stack over IB hardware. Because Alluxio communicates over standard TCP/IP sockets, no code changes or special drivers are required — you only need to load the IPoIB kernel module and bind Alluxio services to the IB network interface.

Applies to: NICs configured with InfiniBand link layer (verified via ibstat | grep "Link layer"). If your ConnectX adapter is running in Ethernet link layer mode, it operates as a standard high-speed Ethernet NIC — Alluxio works with it natively with no IPoIB configuration needed.

IPoIB vs. Native RDMA

IPoIB
Native RDMA (Verbs API)

Protocol

TCP/IP over IB hardware

Bypass kernel, direct memory access

Alluxio support

✅ Fully supported

Not supported in 3.8

Configuration

Bind to IB network interface

Requires RDMA-aware application code

Typical throughput

100–400 Gbps (hardware-dependent)

Lower latency, similar peak bandwidth

Prerequisites

Hardware

  • Mellanox/NVIDIA ConnectX-6 or ConnectX-7 network adapter (or equivalent)

  • InfiniBand switch fabric

Software

Load the IPoIB kernel module and verify that the IB drivers and interfaces are active:

MTU Configuration

IPoIB operates in one of two transport modes that determine the maximum supported MTU:

Mode
Max MTU
Typical environments

Datagram (UD)

2,044 bytes

Cloud-managed IB (Azure HPC, AWS EFA)

Connected (RC)

65,520 bytes

On-premises InfiniBand fabrics

Check the current mode before setting MTU:

If mode is datagram (common on cloud IPoIB), the hardware limit is 2,044 bytes. Setting MTU to 9000 will fail with RTNETLINK answers: Invalid argument — this is expected, not an error. Alluxio works correctly at MTU 2,044.

If mode is connected (typical on-premises), set MTU to 9000 for maximum throughput:

To persist the MTU setting across reboots, add it to your network configuration (e.g., /etc/network/interfaces or a systemd-networkd unit file).

Binding Alluxio to the IB Interface

The hot data path in Alluxio runs between workers and FUSE / client nodes — this is where IB bandwidth matters. The coordinator handles background tasks (metadata operations, background jobs) and is not on the data-serving critical path, so it does not need to run on IB-equipped hardware.

For general NIC binding configuration, see Cluster Management. The steps below extend that guidance specifically for IPoIB deployments.

IPoIB can be exposed to pods through:

  • NVIDIA Network Operator: Automates MLNX_OFED driver deployment and SR-IOV device plugin configuration

  • Multus CNI: Attaches a secondary IB network interface to Alluxio pods

  • SR-IOV Device Plugin: Exposes IB Virtual Functions (VFs) as pod resources

Refer to NVIDIA Network Operator documentation and Multus CNI for setup instructions. Once the IB interface is available inside the pod, apply the same alluxio-site.properties settings from the Bare-Metal tab.

Reference Performance

The following results are from an example test environment using IPoIB with Alluxio running on bare metal.

Test Environment

Parameter
Value

Network

2 × 200 Gbps IPoIB (bonded), measured throughput: 360 Gbps

NIC

Mellanox ConnectX-7 (IB link layer, 200 Gbps)

Cache disk

RAID0, 2 × NVMe, read/write: ~12 GB/s

UFS

Object storage via 100 Gbps dedicated line

Deployment

Bare metal, FUSE and worker co-located

Network Layer (iperf3)

Configuration
Measured Throughput

Single IB port

180 Gb/s

Bonded (2 × 200 Gbps IPoIB)

360 Gbps

Alluxio Read Throughput (Hot Read, Large Files, 32 Concurrent)

Configuration
Sequential Read

1 FUSE + 1 worker, 1 × NVMe

6.3 GB/s

1 FUSE + 1 worker, RAID0 2 × NVMe

12.5 GB/s

3 FUSE + 3 workers, RAID0 2 × NVMe

36.6 GB/s

Observation: With 3 workers and RAID0 NVMe cache, Alluxio hot read throughput approaches the raw disk bandwidth ceiling (~36 GB/s vs. 36 GB/s theoretical RAID0 maximum), confirming that the IPoIB network is not the bottleneck at this scale.

Troubleshooting

Worker not binding to IB interface

  • Run ip addr show ib0 to confirm the interface has an IP address assigned.

  • Verify that alluxio.worker.rpc.bind.device matches the exact interface name (case-sensitive).

  • Check alluxio-worker.log for bind errors.

Workers serve data over Ethernet instead of IB

  • Verify that alluxio.worker.data.bind.device=ib0 is set on each worker node and that the worker process was restarted after the change.

  • Verify that alluxio.user.network.data.bind.device=ib0 is set on the FUSE / client node.

ip link set ib0 mtu 9000 fails with RTNETLINK answers: Invalid argument

  • Your IPoIB interface is in datagram mode, which caps MTU at 2,044 bytes. This is common on cloud-managed InfiniBand (Azure HPC, AWS EFA). Alluxio works correctly at the default MTU — no action needed. See MTU Configuration.

Low throughput despite IPoIB

  • Run iperf3 -c <other node ib0 IP> between nodes to establish a network-layer baseline.

  • Check cat /sys/class/net/ib0/mode — datagram mode (MTU 2,044) will limit peak throughput compared to connected mode (MTU 9,000).

  • Confirm all services are communicating over the IB interface: ss -tnp | grep <ib0 IP>.

IB interface missing after reboot

  • MTU and bonding settings may not have been persisted. Add them to the system network configuration.

  • Verify MLNX_OFED drivers load on boot: lsmod | grep ib_core

Last updated