# RDMA Networking

Alluxio supports several high-speed network technologies commonly deployed in AI and HPC clusters. This page covers configuration and performance guidance for each supported option.

| Technology                          | Status      | Use Case                                            |
| ----------------------------------- | ----------- | --------------------------------------------------- |
| IPoIB (IP over InfiniBand)          | ✅ Supported | Standard TCP/IP over IB hardware, zero code changes |
| RoCE (RDMA over Converged Ethernet) | Planned     | Low-latency RDMA over Ethernet fabric               |
| Native IB API (Verbs)               | Planned     | Ultra-low latency RDMA over InfiniBand fabric       |

## IPoIB

### Overview

InfiniBand (IB) is a high-bandwidth, low-latency interconnect commonly deployed in AI training clusters. Alluxio supports **IP over InfiniBand (IPoIB)**, which runs the standard TCP/IP stack over IB hardware. Because Alluxio communicates over standard TCP/IP sockets, no code changes or special drivers are required — you only need to load the IPoIB kernel module and bind Alluxio services to the IB network interface.

> **Applies to**: NICs configured with **InfiniBand link layer** (verified via `ibstat | grep "Link layer"`). If your ConnectX adapter is running in Ethernet link layer mode, it operates as a standard high-speed Ethernet NIC — Alluxio works with it natively with no IPoIB configuration needed.

#### IPoIB vs. Native RDMA

|                    | IPoIB                             | Native RDMA (Verbs API)               |
| ------------------ | --------------------------------- | ------------------------------------- |
| Protocol           | TCP/IP over IB hardware           | Bypass kernel, direct memory access   |
| Alluxio support    | ✅ Fully supported                 | Not supported in 3.8                  |
| Configuration      | Bind to IB network interface      | Requires RDMA-aware application code  |
| Typical throughput | 100–400 Gbps (hardware-dependent) | Lower latency, similar peak bandwidth |

### Prerequisites

#### Hardware

* Mellanox/NVIDIA ConnectX-6 or ConnectX-7 network adapter (or equivalent)
* InfiniBand switch fabric

#### Software

Load the IPoIB kernel module and verify that the IB drivers and interfaces are active:

```shell
# Load the IPoIB kernel module
modprobe ib_ipoib

# Verify OFED drivers are loaded and link layer is InfiniBand
ibstat
# Expected: adapter state "Active", link layer: InfiniBand

# List IB network interfaces
ip addr show | grep -E "^[0-9]+: ib"
# Expected: one or more ib* interfaces (e.g., ib0, ibs22)

# Confirm the IB interface has an IP address
ip addr show ib0
# Expected: inet <IP>/prefix scope global ib0

# Verify InfiniBand device is accessible
ibv_devinfo
# Expected: hca_id, port_state: PORT_ACTIVE
```

#### MTU Configuration

IPoIB operates in one of two transport modes that determine the maximum supported MTU:

| Mode           | Max MTU      | Typical environments                  |
| -------------- | ------------ | ------------------------------------- |
| Datagram (UD)  | 2,044 bytes  | Cloud-managed IB (Azure HPC, AWS EFA) |
| Connected (RC) | 65,520 bytes | On-premises InfiniBand fabrics        |

Check the current mode before setting MTU:

```shell
cat /sys/class/net/ib0/mode
ip link show ib0 | grep mtu
```

If mode is `datagram` (common on cloud IPoIB), the hardware limit is 2,044 bytes. Setting MTU to 9000 will fail with `RTNETLINK answers: Invalid argument` — this is expected, not an error. Alluxio works correctly at MTU 2,044.

If mode is `connected` (typical on-premises), set MTU to 9000 for maximum throughput:

```shell
ip link set ib0 mtu 9000

# Verify
ip link show ib0 | grep mtu
# Expected: mtu 9000
```

To persist the MTU setting across reboots, add it to your network configuration (e.g., `/etc/network/interfaces` or a systemd-networkd unit file).

### Binding Alluxio to the IB Interface

The hot data path in Alluxio runs between **workers** and **FUSE / client nodes** — this is where IB bandwidth matters. The coordinator handles background tasks (metadata operations, background jobs) and is not on the data-serving critical path, so it does not need to run on IB-equipped hardware.

For general NIC binding configuration, see [Cluster Management](https://documentation.alluxio.io/ee-ai-en/administration/managing-alluxio). The steps below extend that guidance specifically for IPoIB deployments.

#### Worker Configuration

Add the following to `alluxio-site.properties` on each worker node. Replace `ib0` with your actual IB interface name (check with `ip addr show`):

```properties
# Bind all worker services to the IB network interface
alluxio.worker.rpc.bind.device=ib0
alluxio.worker.data.bind.device=ib0
alluxio.worker.web.bind.device=ib0
alluxio.worker.rest.bind.device=ib0
```

Verify after starting the worker:

```shell
# Confirm worker RPC port is listening on the IB interface IP
ss -tlnp | grep 29999
# Expected: the listening address matches the IP of ib0
```

#### FUSE / Client Configuration

For nodes running Alluxio FUSE or direct client access, bind the data channel to the IB interface:

```properties
# Bind the client data channel to the IB network interface
alluxio.user.network.data.bind.device=ib0
```

For FUSE mount options and prerequisites (including `allow_other` configuration), see [POSIX API (FUSE)](https://documentation.alluxio.io/ee-ai-en/data-access/fuse-based-posix-api).

#### Coordinator Configuration

The coordinator does not need to run on IB-equipped hardware. Set `alluxio.coordinator.hostname` to the coordinator node's reachable IP address (typically its Ethernet interface):

```properties
alluxio.coordinator.hostname=<coordinator IP>
```

#### Verify End-to-End Connectivity

After starting all services, confirm that worker–client data traffic flows over the IB interface:

```shell
# Check active connections on the IB interface IP
ss -tnp | grep <IB interface IP>
# Expected: ESTABLISHED connections between workers and FUSE clients

# Confirm IB traffic during reads (watch rx_bytes increment on ib0)
cat /sys/class/net/ib0/statistics/rx_bytes
```

### Kubernetes Deployment

For Kubernetes clusters with IB hardware, IPoIB can be exposed to pods through:

* **NVIDIA Network Operator**: Automates MLNX\_OFED driver deployment and SR-IOV device plugin configuration
* **Multus CNI**: Attaches a secondary IB network interface to Alluxio pods
* **SR-IOV Device Plugin**: Exposes IB Virtual Functions (VFs) as pod resources

Refer to [NVIDIA Network Operator documentation](https://docs.nvidia.com/networking/display/cokan10/network+operator) and [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni) for setup instructions. Once the IB interface is available inside the pod, apply the same `alluxio-site.properties` settings described above.

### Reference Performance

The following results are from an example test environment using IPoIB with Alluxio running on bare metal.

#### Test Environment

| Parameter  | Value                                                      |
| ---------- | ---------------------------------------------------------- |
| Network    | 2 × 200 Gbps IPoIB (bonded), measured throughput: 360 Gbps |
| NIC        | Mellanox ConnectX-7 (IB link layer, 200 Gbps)              |
| Cache disk | RAID0, 2 × NVMe, read/write: \~12 GB/s                     |
| UFS        | Object storage via 100 Gbps dedicated line                 |
| Deployment | Bare metal, FUSE and worker co-located                     |

#### Network Layer (iperf3)

| Configuration               | Measured Throughput |
| --------------------------- | ------------------- |
| Single IB port              | 180 Gb/s            |
| Bonded (2 × 200 Gbps IPoIB) | 360 Gbps            |

#### Alluxio Read Throughput (Hot Read, Large Files, 32 Concurrent)

| Configuration                      | Sequential Read |
| ---------------------------------- | --------------- |
| 1 FUSE + 1 worker, 1 × NVMe        | 6.3 GB/s        |
| 1 FUSE + 1 worker, RAID0 2 × NVMe  | 12.5 GB/s       |
| 3 FUSE + 3 workers, RAID0 2 × NVMe | **36.6 GB/s**   |

> **Observation**: With 3 workers and RAID0 NVMe cache, Alluxio hot read throughput approaches the raw disk bandwidth ceiling (\~36 GB/s vs. 36 GB/s theoretical RAID0 maximum), confirming that the IPoIB network is not the bottleneck at this scale.

### Troubleshooting

**Worker not binding to IB interface**

* Run `ip addr show ib0` to confirm the interface has an IP address assigned.
* Verify that `alluxio.worker.rpc.bind.device` matches the exact interface name (case-sensitive).
* Check `alluxio-worker.log` for `bind` errors.

**Workers serve data over Ethernet instead of IB**

* Verify that `alluxio.worker.data.bind.device=ib0` is set on each worker node and that the worker process was restarted after the change.
* Verify that `alluxio.user.network.data.bind.device=ib0` is set on the FUSE / client node.

**`ip link set ib0 mtu 9000` fails with `RTNETLINK answers: Invalid argument`**

* Your IPoIB interface is in datagram mode, which caps MTU at 2,044 bytes. This is common on cloud-managed InfiniBand (Azure HPC, AWS EFA). Alluxio works correctly at the default MTU — no action needed. See [MTU Configuration](#mtu-configuration).

**Low throughput despite IPoIB**

* Run `iperf3 -c <other node ib0 IP>` between nodes to establish a network-layer baseline.
* Check `cat /sys/class/net/ib0/mode` — datagram mode (MTU 2,044) will limit peak throughput compared to connected mode (MTU 9,000).
* Confirm all services are communicating over the IB interface: `ss -tnp | grep <ib0 IP>`.

**IB interface missing after reboot**

* MTU and bonding settings may not have been persisted. Add them to the system network configuration.
* Verify MLNX\_OFED drivers load on boot: `lsmod | grep ib_core`
