# RDMA Networking

Alluxio supports several high-speed network technologies commonly deployed in AI and HPC clusters. This page covers configuration and performance guidance for each supported option.

| Technology                          | Status      | Use Case                                            |
| ----------------------------------- | ----------- | --------------------------------------------------- |
| IPoIB (IP over InfiniBand)          | ✅ Supported | Standard TCP/IP over IB hardware, zero code changes |
| RoCE (RDMA over Converged Ethernet) | Planned     | Low-latency RDMA over Ethernet fabric               |
| Native IB API (Verbs)               | Planned     | Ultra-low latency RDMA over InfiniBand fabric       |

## IPoIB

### Overview

InfiniBand (IB) is a high-bandwidth, low-latency interconnect commonly deployed in AI training clusters. Alluxio supports **IP over InfiniBand (IPoIB)**, which runs the standard TCP/IP stack over IB hardware. Because Alluxio communicates over standard TCP/IP sockets, no code changes or special drivers are required — you only need to load the IPoIB kernel module and bind Alluxio services to the IB network interface.

> **Applies to**: NICs configured with **InfiniBand link layer** (verified via `ibstat | grep "Link layer"`). If your ConnectX adapter is running in Ethernet link layer mode, it operates as a standard high-speed Ethernet NIC — Alluxio works with it natively with no IPoIB configuration needed.

#### IPoIB vs. Native RDMA

|                    | IPoIB                             | Native RDMA (Verbs API)               |
| ------------------ | --------------------------------- | ------------------------------------- |
| Protocol           | TCP/IP over IB hardware           | Bypass kernel, direct memory access   |
| Alluxio support    | ✅ Fully supported                 | Not supported in 3.8                  |
| Configuration      | Bind to IB network interface      | Requires RDMA-aware application code  |
| Typical throughput | 100–400 Gbps (hardware-dependent) | Lower latency, similar peak bandwidth |

### Prerequisites

#### Hardware

* Mellanox/NVIDIA ConnectX-6 or ConnectX-7 network adapter (or equivalent)
* InfiniBand switch fabric

#### Software

Load the IPoIB kernel module and verify that the IB drivers and interfaces are active:

```shell
# Load the IPoIB kernel module
modprobe ib_ipoib

# Verify OFED drivers are loaded and link layer is InfiniBand
ibstat
# Expected: adapter state "Active", link layer: InfiniBand

# List IB network interfaces
ip addr show | grep -E "^[0-9]+: ib"
# Expected: one or more ib* interfaces (e.g., ib0, ibs22)

# Confirm the IB interface has an IP address
ip addr show ib0
# Expected: inet <IP>/prefix scope global ib0

# Verify InfiniBand device is accessible
ibv_devinfo
# Expected: hca_id, port_state: PORT_ACTIVE
```

#### MTU Configuration

IPoIB operates in one of two transport modes that determine the maximum supported MTU:

| Mode           | Max MTU      | Typical environments                  |
| -------------- | ------------ | ------------------------------------- |
| Datagram (UD)  | 2,044 bytes  | Cloud-managed IB (Azure HPC, AWS EFA) |
| Connected (RC) | 65,520 bytes | On-premises InfiniBand fabrics        |

Check the current mode before setting MTU:

```shell
cat /sys/class/net/ib0/mode
ip link show ib0 | grep mtu
```

If mode is `datagram` (common on cloud IPoIB), the hardware limit is 2,044 bytes. Setting MTU to 9000 will fail with `RTNETLINK answers: Invalid argument` — this is expected, not an error. Alluxio works correctly at MTU 2,044.

If mode is `connected` (typical on-premises), set MTU to 9000 for maximum throughput:

```shell
ip link set ib0 mtu 9000

# Verify
ip link show ib0 | grep mtu
# Expected: mtu 9000
```

To persist the MTU setting across reboots, add it to your network configuration (e.g., `/etc/network/interfaces` or a systemd-networkd unit file).

### Binding Alluxio to the IB Interface

The hot data path in Alluxio runs between **workers** and **FUSE / client nodes** — this is where IB bandwidth matters. The coordinator handles background tasks (metadata operations, background jobs) and is not on the data-serving critical path, so it does not need to run on IB-equipped hardware.

For general NIC binding configuration, see [Cluster Management](/ee-ai-en/administration/managing-alluxio.md). The steps below extend that guidance specifically for IPoIB deployments.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
IPoIB can be exposed to pods through:

* **NVIDIA Network Operator**: Automates MLNX\_OFED driver deployment and SR-IOV device plugin configuration
* **Multus CNI**: Attaches a secondary IB network interface to Alluxio pods
* **SR-IOV Device Plugin**: Exposes IB Virtual Functions (VFs) as pod resources

Refer to [NVIDIA Network Operator documentation](https://docs.nvidia.com/networking/display/cokan10/network+operator) and [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni) for setup instructions. Once the IB interface is available inside the pod, apply the same `alluxio-site.properties` settings from the Bare-Metal tab.
{% endtab %}

{% tab title="Docker / Bare-Metal" %}
**Worker Configuration**

Add the following to `alluxio-site.properties` on each worker node. Replace `ib0` with your actual IB interface name (check with `ip addr show`):

```properties
# Bind all worker services to the IB network interface
alluxio.worker.rpc.bind.device=ib0
alluxio.worker.data.bind.device=ib0
alluxio.worker.web.bind.device=ib0
alluxio.worker.rest.bind.device=ib0
```

Verify after starting the worker:

```shell
# Confirm worker RPC port is listening on the IB interface IP
ss -tlnp | grep 29999
# Expected: the listening address matches the IP of ib0
```

**FUSE / Client Configuration**

For nodes running Alluxio FUSE or direct client access, bind the data channel to the IB interface:

```properties
# Bind the client data channel to the IB network interface
alluxio.user.network.data.bind.device=ib0
```

For FUSE mount options and prerequisites (including `allow_other` configuration), see [POSIX API (FUSE)](/ee-ai-en/data-access/fuse-based-posix-api.md).

**Coordinator Configuration**

The coordinator does not need to run on IB-equipped hardware. Set `alluxio.coordinator.hostname` to the coordinator node's reachable IP address (typically its Ethernet interface):

```properties
alluxio.coordinator.hostname=<coordinator IP>
```

**Verify End-to-End Connectivity**

After starting all services, confirm that worker–client data traffic flows over the IB interface:

```shell
# Check active connections on the IB interface IP
ss -tnp | grep <IB interface IP>
# Expected: ESTABLISHED connections between workers and FUSE clients

# Confirm IB traffic during reads (watch rx_bytes increment on ib0)
cat /sys/class/net/ib0/statistics/rx_bytes
```

{% endtab %}
{% endtabs %}

### Reference Performance

The following results are from an example test environment using IPoIB with Alluxio running on bare metal.

#### Test Environment

| Parameter  | Value                                                      |
| ---------- | ---------------------------------------------------------- |
| Network    | 2 × 200 Gbps IPoIB (bonded), measured throughput: 360 Gbps |
| NIC        | Mellanox ConnectX-7 (IB link layer, 200 Gbps)              |
| Cache disk | RAID0, 2 × NVMe, read/write: \~12 GB/s                     |
| UFS        | Object storage via 100 Gbps dedicated line                 |
| Deployment | Bare metal, FUSE and worker co-located                     |

#### Network Layer (iperf3)

| Configuration               | Measured Throughput |
| --------------------------- | ------------------- |
| Single IB port              | 180 Gb/s            |
| Bonded (2 × 200 Gbps IPoIB) | 360 Gbps            |

#### Alluxio Read Throughput (Hot Read, Large Files, 32 Concurrent)

| Configuration                      | Sequential Read |
| ---------------------------------- | --------------- |
| 1 FUSE + 1 worker, 1 × NVMe        | 6.3 GB/s        |
| 1 FUSE + 1 worker, RAID0 2 × NVMe  | 12.5 GB/s       |
| 3 FUSE + 3 workers, RAID0 2 × NVMe | **36.6 GB/s**   |

> **Observation**: With 3 workers and RAID0 NVMe cache, Alluxio hot read throughput approaches the raw disk bandwidth ceiling (\~36 GB/s vs. 36 GB/s theoretical RAID0 maximum), confirming that the IPoIB network is not the bottleneck at this scale.

### Troubleshooting

**Worker not binding to IB interface**

* Run `ip addr show ib0` to confirm the interface has an IP address assigned.
* Verify that `alluxio.worker.rpc.bind.device` matches the exact interface name (case-sensitive).
* Check `alluxio-worker.log` for `bind` errors.

**Workers serve data over Ethernet instead of IB**

* Verify that `alluxio.worker.data.bind.device=ib0` is set on each worker node and that the worker process was restarted after the change.
* Verify that `alluxio.user.network.data.bind.device=ib0` is set on the FUSE / client node.

**`ip link set ib0 mtu 9000` fails with `RTNETLINK answers: Invalid argument`**

* Your IPoIB interface is in datagram mode, which caps MTU at 2,044 bytes. This is common on cloud-managed InfiniBand (Azure HPC, AWS EFA). Alluxio works correctly at the default MTU — no action needed. See [MTU Configuration](#mtu-configuration).

**Low throughput despite IPoIB**

* Run `iperf3 -c <other node ib0 IP>` between nodes to establish a network-layer baseline.
* Check `cat /sys/class/net/ib0/mode` — datagram mode (MTU 2,044) will limit peak throughput compared to connected mode (MTU 9,000).
* Confirm all services are communicating over the IB interface: `ss -tnp | grep <ib0 IP>`.

**IB interface missing after reboot**

* MTU and bonding settings may not have been persisted. Add them to the system network configuration.
* Verify MLNX\_OFED drivers load on boot: `lsmod | grep ib_core`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/performance/rdma-networking.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
