# RDMA Networking

Alluxio supports several high-speed network technologies commonly deployed in AI and HPC clusters. This page covers configuration and performance guidance for each supported option.

| Technology                          | Status        | Use Case                                                        |
| ----------------------------------- | ------------- | --------------------------------------------------------------- |
| Native RDMA over InfiniBand or RoCE | ✅ Supported   | Zero-copy data transfer, bypasses kernel network stack entirely |
| IPoIB (IP over InfiniBand)          | ✅ Supported   | Standard TCP/IP over IB hardware                                |
| AWS EFA                             | Not supported |                                                                 |
| iWARP                               | Not supported |                                                                 |

## Native RDMA

### Overview

Native RDMA bypasses the kernel network stack entirely, enabling zero-copy data transfer between Alluxio workers and clients. Compared to [IPoIB](#ipoib) (which runs TCP/IP over InfiniBand hardware), native RDMA delivers significantly lower latency and higher throughput for cached data reads.

Native RDMA is supported only on InfiniBand and RoCE. AWS EFA and iWARP are not supported.

Native RDMA is best suited for:

* Clusters where workers and clients are connected via InfiniBand or RoCE
* Latency-sensitive workloads (model loading, inference serving)
* Maximizing Alluxio cache read throughput

### Prerequisites

#### Hardware

* InfiniBand or RoCE-capable NICs (e.g., Mellanox/NVIDIA ConnectX-4 or later)
* InfiniBand fabric or RoCE-enabled Ethernet switches
* RDMA NICs installed on **all worker nodes** and **all client/FUSE nodes** in the data path
* AWS EFA and iWARP adapters are not supported for native RDMA

#### Software

* Linux OS with RDMA kernel modules loaded (kernel 5.0+ recommended)
* MLNX OFED ≥ 25.10 (or DOCA-OFED) or upstream `rdma-core` ≥ 35
* OpenUCX 1.20 or higher installed on all nodes
* JUCX (Java bindings for UCX) — bundled with Alluxio

#### Environment Verification

Before enabling Alluxio native RDMA, verify that the RDMA environment is correctly configured.

**Verify RDMA devices and port status**

```shell
ibv_devinfo
```

Expected output (key fields):

```console
hca_id: mlx5_0
     port:  1
        state:         PORT_ACTIVE (4)
        link_layer:    InfiniBand
        rate:          200 Gb/sec
```

Verify:

* `state` is `PORT_ACTIVE`
* `link_layer` is `InfiniBand` (native IB) or `Ethernet` (IB NIC in Ethernet mode using RoCE)

If `ibv_devinfo` is not found, install `rdma-core` or MLNX\_OFED drivers.

**Install and verify UCX**

Alluxio uses OpenUCX for native RDMA transport. Install UCX 1.20 or later on all worker and client nodes.

Via package manager (Ubuntu/Debian):

```shell
sudo apt-get install -y libucx-dev ucx-utils
```

Via package manager (RHEL/CentOS):

```shell
sudo yum install -y ucx ucx-devel
```

Or build from source for the latest version:

```shell
# Install build prerequisites
sudo apt-get install -y libibverbs-dev librdmacm-dev libnuma-dev
```

```shell
wget https://github.com/openucx/ucx/releases/download/v1.20.0/ucx-1.20.0.tar.gz
tar xzf ucx-1.20.0.tar.gz
cd ucx-1.20.0
./contrib/configure-release --prefix=/usr/local --with-verbs --with-rdmacm
make -j$(nproc)
sudo make install
sudo ldconfig
```

Configure flags explained:

| Flag            | Purpose                                                                |
| --------------- | ---------------------------------------------------------------------- |
| `--with-verbs`  | Enable InfiniBand Verbs transport (required for RDMA data transfer)    |
| `--with-rdmacm` | Enable RDMA Connection Manager (required for connection establishment) |

Verify installation:

```shell
ucx_info -v
```

Expected output: UCX version 1.20 or later.

Verify that RDMA transports are available:

```shell
ucx_info -d | grep -E "rc_verbs|rc_mlx5|rdmacm"
```

Expected output: one or more transport devices listed. If none appear, the RDMA development libraries were not found during UCX build — reinstall `libibverbs-dev` and `librdmacm-dev`, then rebuild UCX.

For more installation options, see the [OpenUCX documentation](https://openucx.readthedocs.io/en/master/running.html#ucx-build-and-install).

**Verify bRPC environment (recommended)**

bRPC provides RDMA-accelerated metadata RPCs and is recommended for best performance in RDMA environments. Verify the bRPC native library is available:

```shell
ls /usr/lib/libbrpc* /usr/local/lib/libbrpc* 2>/dev/null
```

Expected output: one or more `libbrpc*.so` files. If not found, install the bRPC native library separately. Native RDMA data transfer (UCX) works without bRPC, but metadata RPCs will fall back to gRPC over TCP.

### Quick Start

This section provides the minimum steps to enable native RDMA. See [Configuration](#configuration) for advanced tuning.

Add the following to `alluxio-site.properties` on **all worker and client nodes**:

```properties
# Enable RDMA data transfer (UCX) for both workers and clients
alluxio.network.rdma.data.enabled=true

# Enable RDMA-accelerated metadata RPCs (bRPC) — recommended
alluxio.network.rdma.metadata.enabled=true
```

Restart all worker and client processes. Verify the worker RDMA service is running:

```shell
grep -i "RdmaServer" /opt/alluxio/logs/worker.log | tail -5
```

> **Important**: When `alluxio.network.rdma.data.enabled=true`, clients use RDMA exclusively for data reads — there is no automatic fallback to TCP. If a worker's RDMA port is unreachable, the read operation fails (or retries on other replica workers if available).

### Configuration

#### Data Acceleration

| Property                                | Default   | Description                                                                                                                                                                                                                                                                                                                                                                       |
| --------------------------------------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.network.rdma.data.enabled`     | `false`   | Enables RDMA data transfer on workers and clients                                                                                                                                                                                                                                                                                                                                 |
| `alluxio.network.rdma.data.bind.device` | (unset)   | RDMA Verbs device name(s) for the **data plane** (e.g., `mlx5_0:1`). Device names are obtained via `ibv_devinfo` (format: `device_name:port_number`). Multiple devices can be comma-separated (e.g., `mlx5_0:1,mlx5_1:1`). If a single device is specified, Alluxio auto-resolves the RDMA listener address from that NIC. If unset, UCX auto-selects all available RDMA devices. |
| `alluxio.network.rdma.data.bind.host`   | `0.0.0.0` | Bind host/IP for the worker RDMA data listener. This is mainly used when bind address auto-resolution from `alluxio.network.rdma.data.bind.device` is not desired or when multiple RDMA devices are configured.                                                                                                                                                                   |
| `alluxio.network.rdma.data.hostname`    | (unset)   | Hostname or IP that clients use to connect to the worker RDMA data service. If unset, the worker's general hostname is used.                                                                                                                                                                                                                                                      |
| `alluxio.network.rdma.data.port`        | `59999`   | Port for the worker RDMA data service.                                                                                                                                                                                                                                                                                                                                            |

> In most single-NIC deployments, the defaults work without extra address configuration. For multi-NIC environments, set `alluxio.network.rdma.data.hostname` to the client-reachable IP, use `alluxio.network.rdma.data.bind.device` to pin the RDMA NIC, and override `alluxio.network.rdma.data.bind.host` if you do not want the bind address inferred from a single RDMA device.

#### Buffer Pool

Native RDMA data transfer uses a registered memory buffer pool. Tune these settings when workloads need higher RDMA concurrency, larger prefetch capacity, or lower first-read latency.

| Property                                             | Default | Description                                                                                                                                                             |
| ---------------------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.network.rdma.data.buffer.pool.initial.size` | `256MB` | Initial size budget of the RDMA registered buffer pool at startup.                                                                                                      |
| `alluxio.network.rdma.data.buffer.pool.prewarm.size` | `256MB` | Amount of RDMA registered buffer pool memory to pre-warm in the 4MB chunk bucket at startup. This must not exceed `alluxio.network.rdma.data.buffer.pool.initial.size`. |
| `alluxio.network.rdma.data.buffer.pool.max.size`     | `8GB`   | Maximum size of the RDMA registered buffer pool. The pool does not expand beyond this limit, and RDMA prefetching stops when the pool is full.                          |

#### RDMA Prefetcher

The RDMA prefetcher reuses the legacy async prefetcher configuration keys. These properties have the same sequential-read semantics as the standard client-side async prefetcher, but RDMA uses them to control RDMA chunk prefetching instead of TCP-based async prefetch reads.

For the general async prefetch behavior, see [File Reading](/ee-ai-en/performance/file-reading.md#tune-client-side-prefetching-for-sequential-reads).

The following properties are the recommended RDMA prefetcher tunables. Other inherited async prefetcher properties are not recommended for customization unless advised by Alluxio support.

| Property                                                                      | Default | RDMA behavior                                                                                                                                                                                 |
| ----------------------------------------------------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `alluxio.user.position.reader.streaming.async.prefetch.max.part.number`       | `8`     | Maximum number of RDMA chunks to prefetch ahead for each stream. This is shared with the legacy async prefetcher.                                                                             |
| `alluxio.user.position.reader.streaming.async.prefetch.file.length.threshold` | `0`     | If a file length is less than or equal to this value, the RDMA prefetcher attempts to prefetch the whole file immediately. Values less than or equal to `0` disable this small-file behavior. |

Recommended tuning:

* For large-file reads, if client and worker memory resources are sufficient, set `alluxio.user.position.reader.streaming.async.prefetch.max.part.number=16` and increase `alluxio.network.rdma.data.buffer.pool.max.size=16GB` accordingly.
* For small-file reads through FUSE, enable `alluxio.fuse.open.read.status.cache.enabled=true` to reduce repeated read-only open metadata RPCs, and set `alluxio.user.position.reader.streaming.async.prefetch.file.length.threshold` to the largest file size you want RDMA to prefetch immediately.

#### Metadata Acceleration

| Property                                  | Default | Description                                                                                                                                                  |
| ----------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `alluxio.network.rdma.metadata.enabled`   | `false` | Enables the experimental worker RDMA metadata path. When disabled, workers do not start the metadata transport server and clients use the existing RPC path. |
| `alluxio.network.rdma.metadata.port`      | `29995` | Port for the worker RDMA metadata service.                                                                                                                   |
| `alluxio.network.rdma.metadata.device`    | (unset) | RDMA device used by the metadata service. When unset, the metadata transport chooses its default active device.                                              |
| `alluxio.network.rdma.metadata.ib.port`   | `1`     | InfiniBand port used by the metadata service.                                                                                                                |
| `alluxio.network.rdma.metadata.gid.index` | `-1`    | GID index used by the metadata service. Negative means the metadata transport chooses its default GID.                                                       |

> The worker RDMA metadata service uses the worker host/bind host for endpoint addressing. `alluxio.network.rdma.metadata.port` controls the service port, while `alluxio.network.rdma.metadata.*` controls which RDMA device/port/GID the native metadata transport uses.

#### Diagnostics

| Property                                            | Default | Description                                                                         |
| --------------------------------------------------- | ------- | ----------------------------------------------------------------------------------- |
| `alluxio.network.rdma.data.trace.enabled`           | `false` | Enable detailed per-operation RDMA data transport trace statistics in logs.         |
| `alluxio.network.rdma.data.trace.print.interval.ms` | `20s`   | Interval for printing RDMA data transport trace statistics when tracing is enabled. |

### Benchmark

Alluxio provides an RDMA benchmark CLI command to verify connectivity and measure raw performance before running production workloads.

**Command Help**

```shell
alluxio exec rdmaBenchmark --help
```

Available flags:

| Flag            | Default         | Description                                                                        |
| --------------- | --------------- | ---------------------------------------------------------------------------------- |
| `--mode`        | (auto-inferred) | Benchmark mode: `server` or `client` (auto-inferred from `--remote-ip`)            |
| `--local-ip`    | (auto-detected) | Local IP address                                                                   |
| `--local-port`  | `20600`         | Local port                                                                         |
| `--delay-us`    | `0`             | Simulated server-side processing delay in microseconds                             |
| `--remote-ip`   | -               | Remote IP address (client mode only, required)                                     |
| `--remote-port` | `20600`         | Remote port (client mode only)                                                     |
| `--case`        | `all`           | Test case: `basic`, `latency`, `stress`, `stress-write`, `stress-read`, `all`      |
| `--duration`    | `10`            | Benchmark duration in seconds                                                      |
| `--numjobs`     | `-1`            | Number of concurrent threads (-1 for auto-adjustment based on payload size)        |
| `--iodepth`     | `-1`            | Max in-flight operations per thread (-1 for auto-adjustment based on payload size) |

**Start Server**

Start the benchmark server on one node:

```shell
cd /opt/alluxio
./bin/alluxio exec rdmaBenchmark
```

Example output:

```console
[INFO] RDMA Server running on 10.0.1.10:20600 with Delay 0 us. Press Ctrl+C to exit.
```

**Run Latency Test**

Start the client on another RDMA-connected node and run a latency test:

```shell
cd /opt/alluxio
./bin/alluxio exec rdmaBenchmark --remote-ip 10.0.1.10 --case latency --duration 10
```

Example output:

```console
===============================================================
               RDMA ZERO-COPY BENCHMARK SUITE
===============================================================
[INFO] Test started at: 2026-04-14 08:16:02
[INFO] Connecting to 10.0.1.10:20600...

[INFO] Test Case: latency
[INFO] Duration: 10 seconds

--- LATENCY TESTS (Sequential, Zero-Copy) ---
[LATENCY - WRITE] Payload: 4096    Bytes | Iterations: 5000
  -> Min:     1.2 us | Avg:     1.8 us | P50:     1.7 us | P99:     3.5 us | P99.9:    4.8 us | Max:     6.2 us
[LATENCY - READ ] Payload: 4096    Bytes | Iterations: 5000
  -> Min:     0.8 us | Avg:     1.3 us | P50:     1.2 us | P99:     2.8 us | P99.9:    3.5 us | Max:     4.5 us

[LATENCY - WRITE] Payload: 65536   Bytes | Iterations: 5000
  -> Min:     1.8 us | Avg:     2.5 us | P50:     2.3 us | P99:     5.2 us | P99.9:    7.5 us | Max:     9.8 us
[LATENCY - READ ] Payload: 65536   Bytes | Iterations: 5000
  -> Min:     1.5 us | Avg:     2.2 us | P50:     2.0 us | P99:     4.8 us | P99.9:    6.5 us | Max:     8.2 us

[LATENCY - WRITE] Payload: 1048576 Bytes | Iterations: 5000
  -> Min:     4.5 us | Avg:     7.2 us | P50:     6.8 us | P99:    18.5 us | P99.9:   24.2 us | Max:    28.5 us
[LATENCY - READ ] Payload: 1048576 Bytes | Iterations: 5000
  -> Min:     4.8 us | Avg:     7.8 us | P50:     7.5 us | P99:    20.5 us | P99.9:   26.8 us | Max:    32.5 us

===============================================================
                   ALL TESTS COMPLETED
===============================================================
```

**Run Stress Read Test**

Run a stress read test with custom concurrency parameters:

```shell
cd /opt/alluxio
./bin/alluxio exec rdmaBenchmark --remote-ip 10.0.1.10 --case stress-read \
  --duration 10 --numjobs 8 --iodepth 32
```

Example output:

```console
===============================================================
               RDMA ZERO-COPY BENCHMARK SUITE
===============================================================
[INFO] Test started at: 2026-04-14 08:19:13
[INFO] Connecting to 10.0.1.10:20600...

[INFO] Test Case: stress-read
[INFO] Duration: 10 seconds

--- BANDWIDTH STRESS TESTS (Multi-Threaded Concurrent, Zero-Copy) ---
[BANDWIDTH - READ ] Payload: 4096    Bytes | Duration: 10 sec
  -> Config: threads=8, inFlightPerThread=32
  -> [Live]  1280000 ops/s |   42.00 Gbps
  -> [Live]  1360000 ops/s |   44.59 Gbps
  -> [Live]  1420000 ops/s |   46.55 Gbps
  -> [Live]  1385000 ops/s |   45.41 Gbps
  -> [Live]  1445000 ops/s |   47.37 Gbps
  -> [Live]  1398000 ops/s |   45.84 Gbps
  -> [Live]  1472000 ops/s |   48.25 Gbps
  -> [Live]  1508000 ops/s |   49.43 Gbps
  -> [Live]  1536000 ops/s |   50.33 Gbps

[BANDWIDTH - READ ] Payload: 65536   Bytes | Duration: 10 sec
  -> Config: threads=8, inFlightPerThread=32
  -> [Live]   285440 ops/s |  149.68 Gbps
  -> [Live]   295680 ops/s |  155.04 Gbps
  -> [Live]   298240 ops/s |  156.39 Gbps
  -> [Live]   301312 ops/s |  158.00 Gbps
  -> [Live]   304128 ops/s |  159.47 Gbps
  -> [Live]   306560 ops/s |  160.75 Gbps
  -> [Live]   308480 ops/s |  161.75 Gbps
  -> [Live]   310144 ops/s |  162.63 Gbps
  -> [Live]   312064 ops/s |  163.64 Gbps
  -> [Live]   313600 ops/s |  164.45 Gbps

[BANDWIDTH - READ ] Payload: 1048576 Bytes | Duration: 10 sec
  -> Config: threads=8, inFlightPerThread=32
  -> [Live]    18880 ops/s |  158.46 Gbps
  -> [Live]    19264 ops/s |  161.68 Gbps
  -> [Live]    19456 ops/s |  163.29 Gbps
  -> [Live]    19584 ops/s |  164.37 Gbps
  -> [Live]    19648 ops/s |  164.91 Gbps
  -> [Live]    19712 ops/s |  165.45 Gbps
  -> [Live]    19776 ops/s |  165.98 Gbps
  -> [Live]    19840 ops/s |  166.52 Gbps
  -> [Live]    19904 ops/s |  167.05 Gbps

===============================================================
                   ALL TESTS COMPLETED
===============================================================
```

### Troubleshooting

**No RDMA devices found**

```shell
ls /sys/class/infiniband/
```

If the directory is empty:

* Verify the NIC is physically installed: `lspci | grep -i mellanox`
* Verify RDMA kernel modules are loaded: `lsmod | grep ib_core`
* Install or reinstall the MLNX\_OFED driver

**RDMA port not in ACTIVE state**

* Check physical cable connections
* Verify switch configuration
* Run `ibstat` for detailed adapter and port status

**UCX initialization failure**

* Verify UCX is installed: `ucx_info -v`
* Check if `UCX_NET_DEVICES` is set to the correct device (e.g., `mlx5_0:1`)
* Ensure `LD_LIBRARY_PATH` includes the UCX library path

**RDMA operation timeouts**

* Increase `alluxio.network.rdma.data.operation.timeout.ms` (e.g., `60s`)
* Check network health via `ibstat`, verify port state is ACTIVE
* Verify the worker RDMA port is reachable from client nodes

**Buffer pool exhaustion**

* If frequent allocation failures occur, increase `alluxio.network.rdma.data.buffer.pool.max.size`
* If first-read latency spikes are observed, increase `alluxio.network.rdma.data.buffer.pool.initial.size`

**RDMA port busy after restart**

* The RDMA connection manager port may remain busy for up to 10 seconds after process exit. The server retries automatically with 1-second backoff.
* If the port remains stuck, verify no other process is using it:

  ```shell
  ss -tlnp | grep 59999
  ```

**Client RDMA reads fail**

* RDMA does not fall back to TCP. Ensure all target workers have `alluxio.network.rdma.data.enabled=true` and the RDMA port is reachable.
* Check client logs for RDMA connection errors.

## IPoIB

### Overview

InfiniBand (IB) is a high-bandwidth, low-latency interconnect commonly deployed in AI training clusters. Alluxio supports **IP over InfiniBand (IPoIB)**, which runs the standard TCP/IP stack over IB hardware. Because Alluxio communicates over standard TCP/IP sockets, no code changes or special drivers are required — you only need to load the IPoIB kernel module and bind Alluxio services to the IB network interface.

> **Applies to**: NICs configured with **InfiniBand link layer** (verified via `ibstat | grep "Link layer"`). If your ConnectX adapter is running in Ethernet link layer mode, it operates as a standard high-speed Ethernet NIC — Alluxio works with it natively with no IPoIB configuration needed.

#### IPoIB vs. Native RDMA

|                    | IPoIB                             | Native RDMA                                   |
| ------------------ | --------------------------------- | --------------------------------------------- |
| Protocol           | TCP/IP over IB hardware           | Bypass kernel, zero-copy direct memory access |
| Alluxio support    | ✅ Fully supported                 | ✅ Fully supported                             |
| Configuration      | Bind to IB network interface      | See [Configuration](#configuration)           |
| Typical throughput | 100–400 Gbps (hardware-dependent) | Lower latency, similar peak bandwidth         |

### Prerequisites

#### Hardware

* Mellanox/NVIDIA ConnectX-4 or later network adapter
* InfiniBand switch fabric

#### Software

Load the IPoIB kernel module and verify that the IB drivers and interfaces are active:

```shell
# Load the IPoIB kernel module
modprobe ib_ipoib

# Verify OFED drivers are loaded and link layer is InfiniBand
ibstat
# Expected: adapter state "Active", link layer: InfiniBand

# List IB network interfaces
ip addr show | grep -E "^[0-9]+: ib"
# Expected: one or more ib* interfaces (e.g., ib0, ibs22)

# Confirm the IB interface has an IP address
ip addr show ib0
# Expected: inet <IP>/prefix scope global ib0

# Verify InfiniBand device is accessible
ibv_devinfo
# Expected: hca_id, port_state: PORT_ACTIVE
```

#### MTU Configuration

IPoIB operates in one of two transport modes that determine the maximum supported MTU:

| Mode           | Max MTU      | Typical environments                  |
| -------------- | ------------ | ------------------------------------- |
| Datagram (UD)  | 2,044 bytes  | Cloud-managed IB (Azure HPC, AWS EFA) |
| Connected (RC) | 65,520 bytes | On-premises InfiniBand fabrics        |

Check the current mode before setting MTU:

```shell
cat /sys/class/net/ib0/mode
ip link show ib0 | grep mtu
```

If mode is `datagram` (common on cloud IPoIB), the hardware limit is 2,044 bytes. Setting MTU to 9000 will fail with `RTNETLINK answers: Invalid argument` — this is expected, not an error. Alluxio works correctly at MTU 2,044.

If mode is `connected` (typical on-premises), set MTU to 9000 for maximum throughput:

```shell
ip link set ib0 mtu 9000

# Verify
ip link show ib0 | grep mtu
# Expected: mtu 9000
```

To persist the MTU setting across reboots, add it to your network configuration (e.g., `/etc/network/interfaces` or a systemd-networkd unit file).

### Binding Alluxio to the IB Interface

The hot data path in Alluxio runs between **workers** and **FUSE / client nodes** — this is where IB bandwidth matters. The coordinator handles background tasks (metadata operations, background jobs) and is not on the data-serving critical path, so it does not need to run on IB-equipped hardware.

For general NIC binding configuration, see [Cluster Management](/ee-ai-en/administration/managing-alluxio.md). The steps below extend that guidance specifically for IPoIB deployments.

{% tabs %}
{% tab title="Kubernetes (Operator)" %}
IPoIB can be exposed to pods through:

* **NVIDIA Network Operator**: Automates MLNX\_OFED driver deployment and SR-IOV device plugin configuration
* **Multus CNI**: Attaches a secondary IB network interface to Alluxio pods
* **SR-IOV Device Plugin**: Exposes IB Virtual Functions (VFs) as pod resources

Refer to [NVIDIA Network Operator documentation](https://docs.nvidia.com/networking/display/cokan10/network+operator) and [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni) for setup instructions. Once the IB interface is available inside the pod, apply the same `alluxio-site.properties` settings from the Bare-Metal tab.
{% endtab %}

{% tab title="Docker / Bare-Metal" %}
**Worker Configuration**

Add the following to `alluxio-site.properties` on each worker node. Replace `ib0` with your actual IB interface name (check with `ip addr show`):

```properties
# Bind all worker services to the IB network interface
alluxio.worker.rpc.bind.device=ib0
alluxio.worker.data.bind.device=ib0
alluxio.worker.web.bind.device=ib0
alluxio.worker.rest.bind.device=ib0
```

Verify after starting the worker:

```shell
# Confirm worker RPC port is listening on the IB interface IP
ss -tlnp | grep 29999
# Expected: the listening address matches the IP of ib0
```

**FUSE / Client Configuration**

For nodes running Alluxio FUSE or direct client access, bind the data channel to the IB interface:

```properties
# Bind the client data channel to the IB network interface
alluxio.user.network.data.bind.device=ib0
```

For FUSE mount options and prerequisites (including `allow_other` configuration), see [POSIX API (FUSE)](/ee-ai-en/data-access/fuse-based-posix-api.md).

**Coordinator Configuration**

The coordinator does not need to run on IB-equipped hardware. Set `alluxio.coordinator.hostname` to the coordinator node's reachable IP address (typically its Ethernet interface):

```properties
alluxio.coordinator.hostname=<coordinator IP>
```

**Verify End-to-End Connectivity**

After starting all services, confirm that worker–client data traffic flows over the IB interface:

```shell
# Check active connections on the IB interface IP
ss -tnp | grep <IB interface IP>
# Expected: ESTABLISHED connections between workers and FUSE clients

# Confirm IB traffic during reads (watch rx_bytes increment on ib0)
cat /sys/class/net/ib0/statistics/rx_bytes
```

{% endtab %}
{% endtabs %}

### Reference Performance

The following results are from an example test environment using IPoIB with Alluxio running on bare metal.

#### Test Environment

| Parameter  | Value                                                      |
| ---------- | ---------------------------------------------------------- |
| Network    | 2 × 200 Gbps IPoIB (bonded), measured throughput: 360 Gbps |
| NIC        | Mellanox ConnectX-7 (IB link layer, 200 Gbps)              |
| Cache disk | RAID0, 2 × NVMe, read/write: \~12 GB/s                     |
| UFS        | Object storage via 100 Gbps dedicated line                 |
| Deployment | Bare metal, FUSE and worker co-located                     |

#### Network Layer (iperf3)

| Configuration               | Measured Throughput |
| --------------------------- | ------------------- |
| Single IB port              | 180 Gb/s            |
| Bonded (2 × 200 Gbps IPoIB) | 360 Gbps            |

#### Alluxio Read Throughput (Hot Read, Large Files, 32 Concurrent)

| Configuration                      | Sequential Read |
| ---------------------------------- | --------------- |
| 1 FUSE + 1 worker, 1 × NVMe        | 6.3 GB/s        |
| 1 FUSE + 1 worker, RAID0 2 × NVMe  | 12.5 GB/s       |
| 3 FUSE + 3 workers, RAID0 2 × NVMe | **36.6 GB/s**   |

> **Observation**: With 3 workers and RAID0 NVMe cache, Alluxio hot read throughput approaches the raw disk bandwidth ceiling (\~36 GB/s vs. 36 GB/s theoretical RAID0 maximum), confirming that the IPoIB network is not the bottleneck at this scale.

### Troubleshooting

**Worker not binding to IB interface**

* Run `ip addr show ib0` to confirm the interface has an IP address assigned.
* Verify that `alluxio.worker.rpc.bind.device` matches the exact interface name (case-sensitive).
* Check `alluxio-worker.log` for `bind` errors.

**Workers serve data over Ethernet instead of IB**

* Verify that `alluxio.worker.data.bind.device=ib0` is set on each worker node and that the worker process was restarted after the change.
* Verify that `alluxio.user.network.data.bind.device=ib0` is set on the FUSE / client node.

**`ip link set ib0 mtu 9000` fails with `RTNETLINK answers: Invalid argument`**

* Your IPoIB interface is in datagram mode, which caps MTU at 2,044 bytes. This is common on cloud-managed InfiniBand (Azure HPC, AWS EFA). Alluxio works correctly at the default MTU — no action needed. See [MTU Configuration](#mtu-configuration).

**Low throughput despite IPoIB**

* Run `iperf3 -c <other node ib0 IP>` between nodes to establish a network-layer baseline.
* Check `cat /sys/class/net/ib0/mode` — datagram mode (MTU 2,044) will limit peak throughput compared to connected mode (MTU 9,000).
* Confirm all services are communicating over the IB interface: `ss -tnp | grep <ib0 IP>`.

**IB interface missing after reboot**

* MTU and bonding settings may not have been persisted. Add them to the system network configuration.
* Verify MLNX\_OFED drivers load on boot: `lsmod | grep ib_core`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.alluxio.io/ee-ai-en/performance/rdma-networking.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
