RDMA Support

K8s test via Infiniband network - Adapters and Cables / InfiniBand/VPI Adapter Cards - NVIDIA Developer Forums

Running tightly coupled HPC/AI workloads with InfiniBand using NVIDIA Network Operator on AKS | Microsoft Community Hub

Basic Knowledge and Differences of RoCE, IB, and TCP Networks

Before officially starting, we first supplement some basic knowledge related to RDMA:

RDMA: A network communication technology that bypasses the operating system kernel. Its core lies in directly accessing remote memory through the network card, avoiding the data copying and context switching overhead of the traditional TCP/IP protocol stack.
NVIDIA GPU Direct¹: Achieves direct connection between GPU memory and the network card's DMA engine. When the GPU needs to communicate with a remote node, the data can be transmitted directly through InfiniBand or RoCE network cards, without going through the host memory as an intermediary.
Network Virtualization: Macvlan and SR-IOV are two common network virtualization solutions. Macvlan allows containers to create virtual network card interfaces, making them appear as independent devices on the physical network; while SR-IOV divides a single physical function (PF) of a physical network card into multiple virtual functions (VFs) through hardware virtualization capabilities. Each VF can be directly assigned to a Pod for use.
Technical Path: Currently, RDMA mainly has two implementation methods: InfiniBand and RoCE². InfiniBand natively supports the RDMA protocol, requiring a dedicated switch and subnet manager to build an independent network, which is costly; whereas RoCEv2 is based on traditional Ethernet infrastructure, using flow control mechanisms such as PFC and ECN to ensure lossless transmission, and is widely used by internet companies.

Our laboratory adopts the InfiniBand solution. Therefore, we first check the IB information of the relevant devices:

First, we conduct tests on the host machine. Before moving to the cloud, the IB on these machines was functional:

$ ibdev2netdev
mlx5_0 port 1 ==> ibxxxxxx0 (Up)
mlx5_1 port 1 ==> ibxxxxxx1 (Up)

$ ibstat
CA 'mlx5_0'
        Port 1:
                Link layer: InfiniBand
CA 'mlx5_1'
        Port 1:
                Link layer: InfiniBand

Up: Indicates that the InfiniBand port has been successfully activated and established a connection to the network.
Down: Indicates that the InfiniBand port is not activated or has not established a network connection.

2. Using Ansible to batch check node network cards

Group definition:

[ib-v100]
xx.xx.xx.[xx:xx]

[ib-a100]
xx.xx.xx.[xx:xx]

Writing a batch query script:

---
- name: Run ibdev2netdev on InfiniBand hosts
  hosts: ib-v100,ib-a100
  gather_facts: no

  tasks:
    - name: Execute ibdev2netdev command
      ansible.builtin.command: ibdev2netdev
      register: ibdev_output
      changed_when: false

    - name: Display ibdev2netdev output
      ansible.builtin.debug:
        var: ibdev_output.stdout_lines

Since the return value is too long, I won't paste the complete output. From the output of ibdev2netdev, we can see that the InfiniBand configuration of the two types of nodes in the cluster is different:

V100 Nodes

mlx5_0 port 1 ==> ibxxxxxx0 (Up)
mlx5_1 port 1 ==> ibxxxxxx1 (Up)

Each of these nodes has a dual-port IB network card, with a maximum speed of 100Gbp/s for each port, connected to two 36-port IB switches, and the two switches are interconnected with four 100Gbps links.

Each node has two independent InfiniBand ports (mlx5_0 and mlx5_1)
Both ports are in the Up state.

A100 Nodes

mlx5_0 port 1 ==> ibxxxx0 (Down/Up)
mlx5_1 port 1 ==> ibxxxxx0 (Up/Down)
mlx5_bond_0 port 1 ==> bond0 (Up)

Each of these machines has two 200Gbps IB cards, connected to an IB switch. However, not all network cards are functional; only one IB card on each node is connected to the switch via an IB cable.

mlx5_bond_0 is an Ethernet network card, but it appears because it is also from Mellanox.

Subsequently, when installing the RDMA device plugin in Kubernetes, we need the network interface information.

Installing Nvidia Network Operator

[!quote] Network Operator Deployment on Vanilla Kubernetes Cluster

Currently, the most recommended way to integrate RDMA into Kubernetes is through the Nvidia Network Operator. Refer to the official documentation, first install the Operator main program using Helm. Subsequently, the specific RDMA access method will be implemented by deploying another CR.

First, add the Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

Then follow the documentation to download values.yaml to the local machine. Mainly check if NFD needs to be turned off and replace the image with an image address accessible domestically.

Since our cluster has already deployed the Nvidia GPU Operator, we choose to turn off the NFD option.

[!warning] Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, we recommend to avoid the use of CLI arguments in favor of a configuration file.

helm show values nvidia/network-operator --version v25.1.0 > values.yaml

Then install the latest version (v25.1.0) of the Nvidia Network Operator program:

helm upgrade --install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--version v25.1.0 \
-f ./values.yaml \
--wait

After installation, the nvidia-network-operator namespace will have the Operator's Pod. At this point, RDMA is not yet configured, and it still needs to be combined with a specific policy.

$ kubectl get pods -l app.kubernetes.io/name=network-operator
NAME                               READY   STATUS    RESTARTS      AGE
network-operator-xxxxxxxx-xxxxx   1/1     Running   1 (22h ago)   26h

Setting `NicClusterPolicy`

For a beginner, the documentation here is really a bit obscure:

As can be seen in the Deployment Examples (deployment examples) chapter, there are nearly 20 deployment methods. Then ——

What are the performance differences among these deployment methods?
How to choose a deployment method that suits your needs?
After deployment, how to make the Pod access RDMA and other high-performance networks?
What are the minimum requirements for running RDMA testing in containers?
How to test RDMA networks in containers?
What are the common errors and their solutions?

The documentation does not answer these questions, so my exploration was also very difficult. First, I will quickly summarize my current understanding of these questions and reference materials:

Performance Differences: IPoIB (IP over InfiniBand) vs. RDMA performance, in addition, Shared Device Plugin can achieve full bandwidth when only one Pod applies for resources; multiple cases have not been tested yet.
Deployment Method: Currently, the RDMA Shared Device Plugin method is used, and it runs normally on the V100. However, it is unclear whether this method can use aggregated network cards, and it may switch to Host Network mode in the future?
Resource Application: After installation, the node usually adds RDMA-related resources, and in some cases, it is necessary to mark the auxiliary network to be used in the Annotations (such as Multus or Macvlan?)
Minimum Requirements: Verify if the image supports RDMA--Machine Learning Platform - Volcano Engine
How to Test: Prepare a cluster for running RDMA workloads and GPU-Direct RDMA workloads.
Errors and Solutions: See the end of this article

1. Attempt to configure RDMA Shared Device Plugin

[!quote] Network Operator Deployment with Multiple Resources in RDMA Shared Device Plugin

Since my single cluster contains two different IB networks (V100 and A100), I use the Multiple Resources configuration method mentioned in the documentation, specifying the ports of V100 and A100, and reporting the network resources rdma/rdma_v100 and rdma/rdma_a100.

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.01-0.6.0.0-0
    forcePrecompiled: false
    imagePullSecrets: []
    terminationGracePeriodSeconds: 300
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      safeLoad: false
      drain:
        enable: true
        force: true
        podSelector: ""
        timeoutSeconds: 300
        deleteEmptyDir: true
  rdmaSharedDevicePlugin:
    # [map[ifNames:[ens1f0 ens1f1] name:rdma_shared_device_a] map[ifNames:[ens2f0 ens2f1] name:rdma_shared_device_b]]
    repository: ghcr.io/mellanox
    image: k8s-rdma-shared-dev-plugin
    version: v1.5.2
    imagePullSecrets: []
    # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
    # Replace 'devices' with your (RDMA capable) netdevice name.
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_v100",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibxxxxxx0","ibxxxxxx1"],
              "linkTypes": ["infiniband"]
            }
          },
          {
            "resourceName": "rdma_a100",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibxxxx0","ibxxxxx0"],
              "linkTypes": ["infiniband"]
            }
          }
        ]
      }

After deployment, notice that the DaemonSets are started. Thanks to the NFD function, it will not be installed on nodes without IB cards (15b3).

$ kg daemonset
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                                                                                                                                             AGE
mofed-ubuntu22.04-xxxxxxxxx-ds   36        36        36      36           36          feature.node.kubernetes.io/kernel-version.full=5.15.0-134-generic,feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04   24h
rdma-shared-dp-ds                 36        36        36      36           36          feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false

The installation of Nvidia Network Operator includes Ofed drivers and Device Plugin. The former requires privilege, which affects the IB driver on the host. During my testing, this led to a large number of errors on the IB card of an A100 node, the error logs filled the system disk, and interrupted the service for several hours.

After all Pods are Running, verify whether new resources have been added to the nodes:

$ kubectl get nodes -o json | jq -r '.items[] | {
    name: .metadata.name,
    "rdma/rdma_v100": .status.capacity["rdma/rdma_v100"]
} | select(.["rdma/rdma_v100"] != null)'
# Omit the same results
{
  "name": "xxx-v100-xx",
  "rdma/rdma_v100": "63"
}
{
  "name": "xxx-a100-xx",
  "rdma/rdma_a100": "63"
}

At this point, the installation method based on the RDMA Shared Device Plugin has been completed. Some products in ByteDance's Volcano Engine seem to use this method.

2. Attempt to configure GPUDirect Workloads (unsuccessful)

[!quote] Network Operator Deployment for GPUDirect Workloads

This section is mainly a record of the failed attempts during the process. If you are more interested in how to verify the RDMA Shared Device Plugin later, you can directly jump to the next section.

During the configuration of the RDMA Shared Device Plugin (referred to as Method 1 for short), I encountered some other issues, which led me to mistakenly believe that the path of Method 1 was not viable. In the discussion area of the K8s RDMA Shared Dev Plugin project, someone also said the following³ (although there were counterexamples below, I didn't get it working at the time, and thought it was outdated):

[!quote] Adrian Chiris

We should improve the projects README.

the general way to use it with k8s is utilizing secondary network CNI such as macvlan or ipoib (or any CNI essentially can create virtual interfaces on top of existing RDMA capable parent netdev) The general way to use it with k8s is to use a secondary network CNI, such as macvlan or ipoib (or any CNI essentially can create virtual interfaces on top of existing RDMA capable parent netdev)

we should update instructions and examples.

So I read the documentation again and found a section called [GPUDirect Workloads] (inner OS: Are other installation methods not for GPU Workloads?)

Compared to Method 1, this method requires installing the DOCA driver, SR-IOV Device Plugin, Secondary network, Multus CNI, Container Networking plugins, IPAM plugin, where Multus CNI is a secondary network CNI in Kubernetes⁴.

[!quote]

Multus is a CNI (container network interface) plugin that allows multiple network cards to be inserted into a Kubernetes Pod, thereby achieving more flexible network communication. It supports multiple CNI plugins, such as Flannel, Calico, Macvlan, etc., and can be well integrated with other network solutions. In some scenarios, it may be necessary for a Pod to connect to multiple different networks, and Multus can achieve this function, providing multiple network interfaces for the Pod, allowing it to communicate with different networks.

Whereabouts is an IP address management tool that can automatically assign IP addresses to Pods and avoid IP address conflicts. In traditional network configurations, it may be necessary to manually assign different IP address ranges to each host to prevent IP address conflicts. Whereas Whereabouts simplifies this process through its automated IP address assignment mechanism, making IP address management in a Kubernetes cluster more efficient and reliable. It ensures that each Pod gets a unique IP address, even in large-scale cluster environments, effectively avoiding IP address duplication issues.

During deployment, first install the Nic Cluster Policy:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.01-0.6.0.0-0
    forcePrecompiled: false
    imagePullSecrets: []
    terminationGracePeriodSeconds: 300
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      safeLoad: false
      drain:
        enable: true
        force: true
        podSelector: ""
        timeoutSeconds: 300
        deleteEmptyDir: true
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.9.0
    imagePullSecrets: []
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "devices": [],
              "drivers": [],
              "pfNames": [],
              "pciAddresses": [],
              "rootDevices": [],
              "linkTypes": [],
              "isRdma": true
            }
          }
        ]
      }
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.5.0
      imagePullSecrets: []
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.1.0
      imagePullSecrets: []
    ipamPlugin:
      image: whereabouts
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.7.0
      imagePullSecrets: []

Afterwards, we need to specify the assignable IP for Where Abouts, which cannot be repeated with the IP addresses currently in use under the current Layer 2 network (this is somewhat similar to what Metal LB does). Therefore, I first scanned and selected an unused small IP segment.

apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: hostdevice-net
spec:
  networkNamespace: "crater-workspace" # Namespace where workloads are located
  resourceName: "hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.x.152/27",
      "exclude": ["192.168.x.151/32"],
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }

After successful installation, the nodes will have resources of type nvidia.com/hostdev:

$ kubectl get nodes -o json | jq -r '.items[] | {
    name: .metadata.name,
    "nvidia.com/hostdev": .status.capacity["nvidia.com/hostdev"]
} | select(.["nvidia.com/hostdev"] != null)'
# Omit the same results
{
  "name": "xxx-v100-xx",
  "nvidia.com/hostdev": "2"
}
{
  "name": "xxx-a100-xx",
  "nvidia.com/hostdev": "4"
}

To use this special network, we also need to add annotations when submitting the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  namespace: crater-workspace. # The namespace specified earlier
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdevice-net
spec:
  containers:
    - name: appcntr1
      image: <image>
      imagePullPolicy: IfNotPresent
      securityContext:
        capabilities:
          add: ["IPC_LOCK"] # This is required
      command:
        - sh
        - -c
        - sleep inf # The official documentation writes it this way, so how should I test?
      resources:
        requests:
          nvidia.com/hostdev: "1"
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/hostdev: "1"
          nvidia.com/gpu: "1"

After entering the Pod, running the ifconfig command, we find that a network card named net1 has been added. However, what to do next? Although the project repository of Network Operator provides test files⁵, the commands are also sleep inf.

I guess it might be that NCCL needs to specify the network card, etc. Since the RDMA Shared Device Plugin later ran through, I didn't further explore this part, maybe raising my confusion to the official is also a good choice.

To clean up stale resources, you can start kubectl proxy in one terminal:

$ kubectl proxy
Starting to serve on 127.0.0.1:8001

And in another terminal, run the cleanup script (note / needs to be escaped as ~1):

#!/bin/bash

# Check if at least one node name is provided
if [ "$#" -lt 1 ]; then
  echo "Usage: $0 <node-name> [<node-name>...]"
  exit 1
fi

# Prepare the JSON patch data
PATCH_DATA=$(cat <<EOF
[
  {"op": "remove", "path": "/status/capacity/nvidia.com~1hostdev"}
]
EOF
)

# Iterate over each node name provided as an argument
for NODE_NAME in "$@"
do
  # Execute the PATCH request
  curl --header "Content-Type: application/json-patch+json" \
       --request PATCH \
       --data "$PATCH_DATA" \
       http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status

  echo "Patch request sent for node $NODE_NAME"
done

Pass the node name and clean up:

chmod +x ./patch_node_gpu.sh
./patch_node_gpu.sh node1 node2

Verifying RDMA Installation

In this section, we will introduce how to continue verifying the RDMA installation based on the RDMA Shared Device Plugin method.

1. Preparing an RDMA-Supporting Image

[!quote] Verify if the image supports RDMA--Machine Learning Platform - Volcano Engine

A simple Dockerfile suitable for the V100 machine may look like this:

FROM xxx/envd:py3.12-ubuntu22.04-8978
USER root

# Install APT packages
RUN apt-get update && apt-get install -y \
	infiniband-diags perftest ibverbs-providers libibumad3 \
	libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 && \
    rm -rf /var/lib/apt/lists/*

# No Python dependencies specified

Here, my base image already includes commonly used debugging toolkits, Python, and CUDA environments. Mainly through APT to continue installing libraries related to InfiniBand.

After installing these libraries, if we start a Pod without requesting RDMA resources, we can normally see the content of ibstat, but if we try to perform write operations, it will report that there is no InfiniBand or RoCE device.

2. Verification Method on a Single Machine

First, we need to start a Pod that requests RDMA resources:

apiVersion: v1
kind: Pod
metadata:
  name: rdma-test-pod-1
spec:
  containers:
  - image: <image>
    name: rdma-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
	    nvidia.com/v100: "4"
        rdma/rdma_v100: "1"
      requests:
	    nvidia.com/v100: "4"
        rdma/rdma_v100: "1"
    command:
    - sh
    - -c
    - |
      sleep infinity

For regular GPU resources, we have renamed them according to the model, and related information can be found in previous articles.

After the container starts successfully, enter the container:

Enter the following command:

ib_write_bw -d mlx5_1 &

Sample output:

$ ib_write_bw -d mlx5_1 &
[1] 2457716
root@xxx-01:~#
************************************
* Waiting for client to connect... *
************************************

Enter the following command on the same machine:

ib_write_bw -d mlx5_1 127.0.0.1 --report_gbits

Sample output:

$ ib_write_bw -d mlx5_1 127.0.0.1 --report_gbits
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
---------------------------------------------------------------------------------------
 Number of qps   : 1            Transport type : IB
                    RDMA_Write BW Test
 Connection type : RC           Using SRQ      : OFF
 Dual-port       : OFF          Device         : mlx5_1
 PCIe relax order: ON
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Mtu             : 4096[B]
 Link type       : IB
 Link type       : IB
 Max inline data : 0[B]
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
 local address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
 local address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
 remote address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
 remote address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1000.000000 != 3013.932000. CPU Frequency is not max.
 65536      5000             94.72              94.71              0.180640
---------------------------------------------------------------------------------------
 65536      5000             94.72              94.71              0.180640
---------------------------------------------------------------------------------------
[1]+  Done                    ib_write_bw -d mlx5_1

For V100 RDMA machines, the bandwidth value (BW peak, BW average) should be close to 100Gb/s, and for A100 RDMA machines, it should be close to 200Gb/s. If it meets the requirements, it indicates that the configuration is correct. If there is no output or error, please go back to the section of configuring the environment according to the machine model and check for any missing configuration items.

3. Verification Method on Multiple Machines

Similar to the second section, apply for two Pods respectively and record the Kubernetes internal IP address of one of the Pods, then run the command:

# server cmd
ib_write_bw -a -F --report_gbits -q 2

# client cmd
ib_write_bw -a -F --report_gbits -q 2 <server-pod-default-network-IP>

The bandwidth value is also close to 100Gb/s, indicating that the connection between multiple machines is normal.

4. vLLM Multi-Machine Distributed Inference Practice

Finally, we tested running vLLM multi-machine distributed inference of the DeepSeek R1 Distill Qwen 32B model through Volcano Job. Our model is mounted through PVC, and the image is made through Envd. Since vLLM will install a custom CUDA 12.4, the base image does not need to contain CUDA.

# syntax=v1

def build():
    base(image="ubuntu:22.04",dev=True)
    install.python(version="3.12")
    install.apt_packages([
        "openssh-server", "build-essential", "iputils-ping", "net-tools", "htop",
        "infiniband-diags", "perftest", "ibverbs-providers", "libibumad3",
        "libibverbs1", "libnl-3-200", "libnl-route-3-200", "librdmacm1"
    ])
    config.pip_index(url = "https://pypi.tuna.tsinghua.edu.cn/simple")
    install.python_packages(name = ["vllm"])
    config.jupyter()

Afterwards, we started the Volcano Job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vllm-rdma-test
  namespace: crater-workspace
spec:
  maxRetry: 3
  minAvailable: 2
  plugins:
    pytorch:
      - --master=master
      - --worker=worker
      - --port=23456
    svc: []
  policies:
    - action: RestartJob
      event: PodEvicted
  queue: default
  schedulerName: volcano
  tasks:
    - maxRetry: 3
      minAvailable: 1
      name: master
      policies:
        - action: CompleteJob
          event: TaskCompleted
        - action: TerminateJob
          event: PodFailed
      replicas: 1
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - |-
                  ray start --head --port=6667 --disable-usage-stats;
                  NCCL_DEBUG=TRACE python3 -m vllm.entrypoints.openai.api_server \
                  --model=/models/DeepSeek-R1-Distill-Qwen-32B \
                  --max-model-len 32768 \
                  --tensor-parallel-size 4 \
                  --pipeline-parallel-size 2 \
                  --gpu-memory-utilization 0.90 \
                  --max-num-seqs 128 \
                  --trust-remote-code \
                  --disable-custom-all-reduce \
                  --port 6666 \
                  --dtype=half;
              image: xxx/envd-vllm:0.8.3-cu12.4-rdma-v1
              name: master
              resources:
                limits:
                  nvidia.com/v100: "4"
                  rdma/rdma_v100: "1"
                requests:
                  nvidia.com/v100: "4"
                  rdma/rdma_v100: "1"
              securityContext:
                capabilities:
                  add:
                    - IPC_LOCK
                runAsGroup: 0
                runAsUser: 0
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: crater-cache
                - mountPath: /models/DeepSeek-R1-Distill-Qwen-32B
                  name: crater-ro-storage
                  readOnly: true
                  subPath: LLM/deepseek/DeepSeek-R1-Distill-Qwen-32B
              workingDir: /models
          restartPolicy: Never
          volumes:
            - emptyDir:
                medium: Memory
              name: crater-cache
            - name: crater-ro-storage
              persistentVolumeClaim:
                claimName: crater-ro-storage
    - maxRetry: 3
      minAvailable: 1
      name: worker
      replicas: 1
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - |-
                  ray start --address="$MASTER_ADDR:6667";
                  sleep infinity;
              image: xxx/envd-vllm:0.8.3-cu12.4-rdma-v1
              name: worker
              resources:
                limits:
                  nvidia.com/v100: "4"
                  rdma/rdma_v100: "1"
                requests:
                  nvidia.com/v100: "4"
                  rdma/rdma_v100: "1"
              securityContext:
                capabilities:
                  add:
                    - IPC_LOCK
                runAsGroup: 0
                runAsUser: 0
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              volumeMounts:
                - mountPath: /dev/shm
                  name: crater-cache
                - mountPath: /models/DeepSeek-R1-Distill-Qwen-32B
                  name: crater-ro-storage
                  readOnly: true
                  subPath: LLM/deepseek/DeepSeek-R1-Distill-Qwen-32B
              workingDir: /models
          restartPolicy: OnFailure
          volumes:
            - emptyDir:
                medium: Memory
              name: crater-cache
            - name: crater-ro-storage
              persistentVolumeClaim:
                claimName: crater-ro-storage
  ttlSecondsAfterFinished: 259200

According to the vLLM documentation on distributed inference⁶, we enabled NCCL_DEBUG=TRACE, and in the logs, we can see that NCCL used IB instead of Socket connections.

During the inference process, Kubernetes also does not detect inter-machine communication traffic, indicating that our deployment has been successful.

Lifting the Memory Lock Limit

0. Problem Description

The following errors are caused by the memory lock limit.

The following error occurs during connectivity testing:

[host1] $ ib_read_bw -q 30

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 30           Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdma_cm
Failed to exchange data between server and clients

[host2] $ ib_read_bw -q 30 10.244.46.50
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources

When performing RDMA bandwidth testing, small data volume (e.g., 1024 bytes) transmission is normal, but 1M and above data volume transmission fails. The specific error message is similar to above, see Issue #339.

Directly using ulimit -l to view the limit, what we get is 64KB, which is the system's default memory lock limit.

At this time, using ulimit -l unlimited in the container or modifying /etc/security/limits.conf will not successfully change the memory lock limit.

1. Problem Analysis

The core of RDMA is letting the hardware (network card HCA) bypass the CPU and directly access remote memory. In normal operations, to optimize memory, the operating system will "page out" or move data at any time, causing the physical address corresponding to the virtual address to change. If the kernel pages out or moves a page of memory while the network card is transmitting data, it will lead to a transmission crash or even a system error.

So, to ensure the address is absolutely stable, RDMA must perform "Memory Registration" (MR) before transmission. This action "locks" the specified virtual memory pages into physical memory at the kernel level, prohibiting the kernel from moving or swapping them to disk.

Crater has already allowed jobs to perform memory locking operations by setting CAP_IPC_LOCK, but it hasn't modified the limit on the allowed locking amount. Also, since the container isn't given the CAP_SYS_RESOURCE privilege, operations inside the container cannot modify this limit either.

Currently, Kubernetes officially doesn't provide a ulimit setting in the Pod Spec, see Issue #3595. Therefore, it's necessary to handle it from the container runtime perspective.

Additionally, it's worth looking forward to the fact that the Kubernetes community has officially started a discussion on native support for Pod-level ulimit configuration in the v1.36 cycle, see KEP-5758.

The dockerd runtime provides the corresponding configuration item default-ulimits, which can easily configure this limit at the node level, see configuration file description and resource limit guide.

However, containerd is currently used in the cluster, and it doesn't provide a configuration item similar to dockerd, see configuration file description. Meanwhile, the developer explicitly refused to provide daemon-level configuration in Issue #3150. Therefore, another way is needed to solve this problem.

Additionally, after trying, directly modifying /etc/security/limits.conf on the node to lift the limit, or only setting LimitMEMLOCK=infinity for the container runtime daemon also cannot release the limit inside the Pod. The root cause is speculated to be that the container executor runc will strictly follow the definition in the OCI specification blueprint and perform the setrlimit system call for the container at startup. This will forcibly override (usually pull down) the container's limit to the default value in the blueprint (e.g., 64KB), thus ignoring the daemon's own authority limit.

2. Solution

The core of this solution lies in modifying the OCI Runtime Specification. In the context of containerd, the JSON file specified by the base_runtime_spec parameter is considered the container's base specification template. It is the final authoritative source for defining container resource boundaries (such as memlock limit) (see OCI Configuration Specification). The underlying OCI runtime (such as runc) will strictly follow the blueprint's definition to initialize the container process's resource limit through the setrlimit system call.

Exporting and Modifying the Existing Configuration

Since containerd doesn't support automatically merging this parameter with the system default configuration (Merge), but takes a full replacement strategy (Replace), we must first export a configuration template containing the current system's full definition, and then fine-tune on its basis. Run the following command on the node.

ctr oci spec > /etc/containerd/rdma-spec.json
vim /etc/containerd/rdma-spec.json

Modify the following part of the configuration file (append RLIMIT_MEMLOCK to the rlimits array):

        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            },
            {
                "type": "RLIMIT_MEMLOCK",
                "hard": 18446744073709551615,
                "soft": 18446744073709551615
            }
        ],

Referencing the Modified Configuration

Modify the containerd configuration.

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          # Modify to point to the absolute path of the above JSON template
          base_runtime_spec = "/etc/containerd/rdma-spec.json"
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = "podsandbox"
          snapshotter = ""

It needs to be modified according to the specific situation, with the goal of letting the corresponding Pod successfully apply the corresponding configuration.

After modifying the container runtime configuration and restarting the service, it will not affect the Pods that are already running. You must manually delete the old Pods, and have Kubernetes reschedule and trigger containerd to use the new OCI blueprint for initialization, then the configuration will truly take effect.

Lifting containerd's Own Limit (Optional)

EDITOR=vim systemctl edit containerd

Add the following content to the daemon configuration.

[Service]
LimitMEMLOCK=infinity

Then use systemctl restart containerd to restart the service.

Although experiments show that because containerd has CAP_SYS_RESOURCE privilege, only modifying the OCI blueprint can take effect, completing this item is to ensure the robustness of the daemon itself, and it is recommended as a best practice in production environments.

Testing

By now, the memory lock limit in the Pod should have been lifted, and we can run the RDMA bandwidth test command again, being able to see the following output on the master and worker respectively.

Master:

root@pyt-liuyizhou-260206-2a3fe-master-0:/# ib_write_bw -s 1M -d mlx5_0 -F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x2c QPN 0x0087 PSN 0xcc6a4c RKey 0x1fcc bd VAddr 0x037f67d3b5030
 remote address: LID 0x2c QPN 0x0088 PSN 0x6dc93a RKey 0x1fcf be VAddr 0x037f026db4030
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1048576    5000             11133.60            11129.09           0.011123
---------------------------------------------------------------------------------------

Worker:

root@pyt-liuyizhou-260206-2a3fe-worker-0:/# ib_write_bw -s 1M -d mlx5_0 -F 10.244.44.139
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x2c QPN 0x0088 PSN 0x6dc93a RKey 0x1fcf be VAddr 0x037f026db4030
 remote address: LID 0x2c QPN 0x0087 PSN 0xcc6a4c RKey 0x1fcc bd VAddr 0x037f67d3b5030
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1048576    5000             11133.60            11129.09           0.011123
---------------------------------------------------------------------------------------

root@pyt-liuyizhou-260206-2a3fe-worker-0:/# ulimit -l
unlimited

As can be seen, the memory lock limit inside the Pod has been lifted, and normal RDMA communication with large bandwidth can be performed.

Note that if the node kernel or related components are updated, it may be necessary to lift the memory lock limit again.

Tolerance of RDMA Tools Pods to Crater Taints

0. Problem Description

After the node kernel upgrade, it was observed that RDMA jobs could not be scheduled to the node, but non-RDMA jobs could.

The reason is that the node did not report RDMA resources, even though the relevant network cards and drivers were working normally.

1. Problem Analysis

Resource Discovery Flow and Analysis of Blocking Points

In normal circumstances, the NVIDIA Network Operator automatically manages the node's RDMA resources through the following flow. This flow is re-triggered when the kernel is updated:

Status Monitoring and Change Locking (Network Operator): The Operator monitors the host's kernel version in real-time. Once a kernel change is detected, it immediately sets the node's network.nvidia.com/operator.mofed.wait label to true. This action "locks" the resource discovery path to prevent allocating incorrect resources before the driver compatibility is confirmed.
Driver Environment Verification and Reporting (MOFED Driver Pod): The Operator attempts to schedule and run the mofed-driver Pod on each node. This Pod is responsible for checking, installing, or updating the host's Mellanox driver. Only after this Pod successfully runs on the corresponding node and returns a readiness signal will the Operator update the mofed.wait label to false.
- Core Blocking Point: If the node has custom business taints (e.g., crater.raids.io/account exclusive) or is in maintenance mode (Cordoned, with the node.kubernetes.io/unschedulable taint), and the NicClusterPolicy does not have the corresponding Tolerations configured, the driver Pod will be intercepted by the scheduler. This causes the flow to get stuck at this step, with the node label staying permanently at true, thereby locking resource discovery.
Plugin Activation and Resource Exposure (RDMA Shared Device Plugin Pod): Once mofed.wait becomes false, the locked RDMA Shared Device Plugin Pod will start its detection logic, identify physical network cards, and report resources such as rdma/rdma_v100 to the Kubelet.

Driver Self-healing Mechanism (DKMS)

In this case, although the K8s control plane's resource discovery flow was blocked due to scheduling permission issues, the host's physical data plane was actually normal. This is because:

DKMS (Dynamic Kernel Module Support): Host-level DKMS is configured. During the first boot after the kernel upgrade, the system automatically completed the reconstruction and loading of kernel modules in the background.
Cognitive Bias: A "cognitive bias" occurred where the physical driver was ready, but the Operator incorrectly believed the driver was invalid because the verification Pod could not enter the node.

2. Solution

The key to solving this problem is to supplement the necessary toleration configurations for the NicClusterPolicy.

Configuration Location and Verification

Note that different versions of the NVIDIA Network Operator have different field support for NicClusterPolicy. Before attempting to modify, be sure to verify the configuration structure.

For example, in the v1alpha1 version, tolerations are usually configured globally under spec, rather than inside sub-components like ofedDriver or rdmaSharedDevicePlugin. You can confirm this with the following commands:

# Check if sub-components support tolerations (if not, it will prompt 'unknown field')
kubectl explain nicclusterpolicy.spec.ofedDriver

# Check if the spec level supports global tolerations
kubectl explain nicclusterpolicy.spec

Implementation of Precise Patch (Patch)

After determining the path, it is recommended to use kubectl patch for "surgical" modification, avoiding the introduction of system metadata or status field conflicts from directly applying an exported YAML file.

Note: Patch operations are persistent modifications directly applied to the cluster instance. If your cluster is deployed via Helm and has resource auto-sync enabled (e.g., ArgoCD), performing a direct Patch may be overwritten by the upper-level tool. In this case, it is recommended to modify the Helm values.yaml configuration synchronously.

Before formal execution, be sure to add --dry-run=server for a server-side dry-run test to verify the correctness of the field path and syntax:

kubectl patch nicclusterpolicy nic-cluster-policy --type='merge' -p '{
  "spec": {
    "tolerations": [
      {
        "key": "crater.raids.io/account",
        "operator": "Exists",
        "effect": "NoSchedule"
      },
      {
        "key": "node.kubernetes.io/unschedulable",
        "operator": "Exists",
        "effect": "NoSchedule"
      }
    ]
  }
}' --dry-run=server

If the dry-run output is patched and there are no errors, you can remove --dry-run=server to take effect formally.

Verification Steps

Observe Pod Status: Run kubectl get pods -n nvidia-network-operator -w to confirm that mofed-driver and rdma-shared-dp Pods successfully start on the corresponding node.
Check Node Status: Run kubectl get node [node-name] -L network.nvidia.com/operator.mofed.wait to confirm that the label changes back to false.
Confirm Resource Reporting: Run kubectl describe node [node-name]. You should see rdma/rdma_v100 again in the Allocatable list.

Problem Records

1. Error `Segmentation fault` when starting vLLM

From the logs, the IB device has been successfully recognized, but a segmentation fault occurred.

[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
[device-name]-master-0:528:528 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
    self.device_communicator = device_comm_cls(
[device-name]-master-0:528:528 [0] NCCL INFO P2P Chunksize set to 131072
[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [receive] via NET/IB/0
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [receive] via NET/IB/0
[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
[device-name]-master-0:528:5379 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer [device-name]-worker-0.[hostname].svc.cluster.local<35396>
[device-name]-master-0:528:5379 [0] NCCL INFO misc/socket.cc:752 -> 6
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net_ib.cc:1207 -> 6
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net.cc:837 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO transport/net.cc:405 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO transport.cc:183 -> 6
                               ^^^^^^^^^^^^^^^^
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1263 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1548 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1799 -> 6
  File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
[device-name]-master-0:528:5379 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer [device-name]-worker-0.[hostname].svc.cluster.local<52144>
    self.pynccl_comm = PyNcclCommunicator(
                       ^^^^^^^^^^^^^^^^^^^
[device-name]-master-0:528:5379 [0] NCCL INFO misc/socket.cc:752 -> 6
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net_ib.cc:1207 -> 6
  File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
    self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
  File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
    self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
  File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
    raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net.cc:837 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1837 -> 6
*** SIGSEGV received at time=1745072123 on cpu 70 ***
PC: @     0x7ff94e269506  (unknown)  ncclProxyServiceUDS()
    @     0x7ffa0c242520       3384  (unknown)
    @ ... and at least 1 more frames
[2025-04-19 14:15:23,982 E 528 5383] logging.cc:484: *** SIGSEGV received at time=1745072123 on cpu 70 ***
[2025-04-19 14:15:23,982 E 528 5383] logging.cc:484: PC: @     0x7ff94e269506  (unknown)  ncclProxyServiceUDS()
[2025-04-19 14:15:23,983 E 528 5383] logging.cc:484:     @     0x7ffa0c242520       3384  (unknown)
[2025-04-19 14:15:23,983 E 528 5383] logging.cc:484:     @ ... and at least 1 more frames
Fatal Python error: Segmentation fault

Remember we mentioned before that we need to add IPC_LOCK in the Pod's security context? If not added, it will lead to the above problem.

2. Failure of Multi-machine Inference on A100 Model

First, run single-machine verification on the A100 model. If an Up network card is used, there seems to be no problem:

$ ib_write_bw -d mlx5_1 &
[1] 1501

************************************
* Waiting for client to connect... *
************************************


$ ib_write_bw -d mlx5_1 127.0.0.1 --report_gbits
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 863.109000 != 2300.000000. CPU Frequency is not max.
 65536      5000             183.66             183.61             0.350214
---------------------------------------------------------------------------------------
 65536      5000             183.66             183.61             0.350214
---------------------------------------------------------------------------------------

$ ib_write_bw -d mlx5_0 &
[1] 1618

Port number 1 state is Down
 Couldn't set the link layer
 Couldn't get context for the device

$ ib_write_bw -d mlx5_0 127.0.0.1 --report_gbits
 Port number 1 state is Down
 Couldn't set the link layer
 Couldn't get context for the device

$ ibstat
CA 'mlx5_0'
        CA type: MT4123
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 3
                LMC: 0
                SM lid: 1
                Link layer: InfiniBand
CA 'mlx5_bond_0'
        CA type: MT4117
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Link layer: Ethernet

$ ib_write_bw -d mlx5_bond_0 &
IB device mlx5_bond_0 not found
 Unable to find the Infiniband/RoCE device

But when running vLLM, an error will occur, and it was later found that this problem is related to vLLM V1 and Ray⁷, and not related to IB. Since I happened to have vLLM multi-machine distributed inference code in hand, I used it for testing. Actually, running something like NCCL Test might be better to avoid some out-of-field interference factors.

Summary

The above records the process of connecting RDMA in the local Kubernetes cluster. At present, the lack of relevant documents and the wide range of problem domains involved in the process are the main factors hindering learning in this area. Indeed, a relatively solid computer foundation is required.

Edit on GitHub

RDMA Support

Table of Contents