RDMA Support
RDMA
Before officially starting, we first supplement some basic knowledge related to RDMA:
- RDMA: A network communication technology that bypasses the operating system kernel. Its core lies in directly accessing remote memory through the network card, avoiding the data copying and context switching overhead of the traditional TCP/IP protocol stack.
- NVIDIA GPU Direct1: Achieves direct connection between GPU memory and the network card's DMA engine. When the GPU needs to communicate with a remote node, the data can be transmitted directly through InfiniBand or RoCE network cards, without going through the host memory as an intermediary.
- Network Virtualization: Macvlan and SR-IOV are two common network virtualization solutions. Macvlan allows containers to create virtual network card interfaces, making them appear as independent devices on the physical network; while SR-IOV divides a single physical function (PF) of a physical network card into multiple virtual functions (VFs) through hardware virtualization capabilities. Each VF can be directly assigned to a Pod for use.
- Technical Path: Currently, RDMA mainly has two implementation methods: InfiniBand and RoCE2. InfiniBand natively supports the RDMA protocol, requiring a dedicated switch and subnet manager to build an independent network, which is costly; whereas RoCEv2 is based on traditional Ethernet infrastructure, using flow control mechanisms such as PFC and ECN to ensure lossless transmission, and is widely used by internet companies.
Our laboratory adopts the InfiniBand solution. Therefore, we first check the IB information of the relevant devices:
1. Testing InfiniBand-related information on a single node
First, we conduct tests on the host machine. Before moving to the cloud, the IB on these machines was functional:
$ ibdev2netdev
mlx5_0 port 1 ==> ibxxxxxx0 (Up)
mlx5_1 port 1 ==> ibxxxxxx1 (Up)
$ ibstat
CA 'mlx5_0'
Port 1:
Link layer: InfiniBand
CA 'mlx5_1'
Port 1:
Link layer: InfiniBand- Up: Indicates that the InfiniBand port has been successfully activated and established a connection to the network.
- Down: Indicates that the InfiniBand port is not activated or has not established a network connection.
2. Using Ansible to batch check node network cards
Group definition:
[ib-v100]
xx.xx.xx.[xx:xx]
[ib-a100]
xx.xx.xx.[xx:xx]Writing a batch query script:
---
- name: Run ibdev2netdev on InfiniBand hosts
hosts: ib-v100,ib-a100
gather_facts: no
tasks:
- name: Execute ibdev2netdev command
ansible.builtin.command: ibdev2netdev
register: ibdev_output
changed_when: false
- name: Display ibdev2netdev output
ansible.builtin.debug:
var: ibdev_output.stdout_linesSince the return value is too long, I won't paste the complete output. From the output of ibdev2netdev, we can see that the InfiniBand configuration of the two types of nodes in the cluster is different:
V100 Nodes
mlx5_0 port 1 ==> ibxxxxxx0 (Up)
mlx5_1 port 1 ==> ibxxxxxx1 (Up)Each of these nodes has a dual-port IB network card, with a maximum speed of 100Gbp/s for each port, connected to two 36-port IB switches, and the two switches are interconnected with four 100Gbps links.
- Each node has two independent InfiniBand ports (mlx5_0 and mlx5_1)
- Both ports are in the Up state.
A100 Nodes
mlx5_0 port 1 ==> ibxxxx0 (Down/Up)
mlx5_1 port 1 ==> ibxxxxx0 (Up/Down)
mlx5_bond_0 port 1 ==> bond0 (Up)Each of these machines has two 200Gbps IB cards, connected to an IB switch. However, not all network cards are functional; only one IB card on each node is connected to the switch via an IB cable.
mlx5_bond_0 is an Ethernet network card, but it appears because it is also from Mellanox.
Subsequently, when installing the RDMA device plugin in Kubernetes, we need the network interface information.
Installing Nvidia Network Operator
[!quote] Network Operator Deployment on Vanilla Kubernetes Cluster
Currently, the most recommended way to integrate RDMA into Kubernetes is through the Nvidia Network Operator. Refer to the official documentation, first install the Operator main program using Helm. Subsequently, the specific RDMA access method will be implemented by deploying another CR.
First, add the Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidiaThen follow the documentation to download values.yaml to the local machine. Mainly check if NFD needs to be turned off and replace the image with an image address accessible domestically.
Since our cluster has already deployed the Nvidia GPU Operator, we choose to turn off the NFD option.
[!warning] Since several parameters should be provided when creating custom resources during operator deployment, it is recommended to use a configuration file. While it is possible to override the parameters via CLI, we recommend to avoid the use of CLI arguments in favor of a configuration file.
helm show values nvidia/network-operator --version v25.1.0 > values.yamlThen install the latest version (v25.1.0) of the Nvidia Network Operator program:
helm upgrade --install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--version v25.1.0 \
-f ./values.yaml \
--waitAfter installation, the nvidia-network-operator namespace will have the Operator's Pod. At this point, RDMA is not yet configured, and it still needs to be combined with a specific policy.
$ kubectl get pods -l app.kubernetes.io/name=network-operator
NAME READY STATUS RESTARTS AGE
network-operator-xxxxxxxx-xxxxx 1/1 Running 1 (22h ago) 26hSetting NicClusterPolicy
For a beginner, the documentation here is really a bit obscure:
As can be seen in the Deployment Examples (deployment examples) chapter, there are nearly 20 deployment methods. Then ——
- What are the performance differences among these deployment methods?
- How to choose a deployment method that suits your needs?
- After deployment, how to make the Pod access RDMA and other high-performance networks?
- What are the minimum requirements for running RDMA testing in containers?
- How to test RDMA networks in containers?
- What are the common errors and their solutions?
The documentation does not answer these questions, so my exploration was also very difficult. First, I will quickly summarize my current understanding of these questions and reference materials:
- Performance Differences: IPoIB (IP over InfiniBand) vs. RDMA performance, in addition, Shared Device Plugin can achieve full bandwidth when only one Pod applies for resources; multiple cases have not been tested yet.
- Deployment Method: Currently, the RDMA Shared Device Plugin method is used, and it runs normally on the V100. However, it is unclear whether this method can use aggregated network cards, and it may switch to Host Network mode in the future?
- Resource Application: After installation, the node usually adds RDMA-related resources, and in some cases, it is necessary to mark the auxiliary network to be used in the Annotations (such as Multus or Macvlan?)
- Minimum Requirements: Verify if the image supports RDMA--Machine Learning Platform - Volcano Engine
- How to Test: Prepare a cluster for running RDMA workloads and GPU-Direct RDMA workloads.
- Errors and Solutions: See the end of this article
1. Attempt to configure RDMA Shared Device Plugin
[!quote] Network Operator Deployment with Multiple Resources in RDMA Shared Device Plugin
Since my single cluster contains two different IB networks (V100 and A100), I use the Multiple Resources configuration method mentioned in the documentation, specifying the ports of V100 and A100, and reporting the network resources rdma/rdma_v100 and rdma/rdma_a100.
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.01-0.6.0.0-0
forcePrecompiled: false
imagePullSecrets: []
terminationGracePeriodSeconds: 300
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
safeLoad: false
drain:
enable: true
force: true
podSelector: ""
timeoutSeconds: 300
deleteEmptyDir: true
rdmaSharedDevicePlugin:
# [map[ifNames:[ens1f0 ens1f1] name:rdma_shared_device_a] map[ifNames:[ens2f0 ens2f1] name:rdma_shared_device_b]]
repository: ghcr.io/mellanox
image: k8s-rdma-shared-dev-plugin
version: v1.5.2
imagePullSecrets: []
# The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
# Replace 'devices' with your (RDMA capable) netdevice name.
config: |
{
"configList": [
{
"resourceName": "rdma_v100",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibxxxxxx0","ibxxxxxx1"],
"linkTypes": ["infiniband"]
}
},
{
"resourceName": "rdma_a100",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibxxxx0","ibxxxxx0"],
"linkTypes": ["infiniband"]
}
}
]
}After deployment, notice that the DaemonSets are started. Thanks to the NFD function, it will not be installed on nodes without IB cards (15b3).
$ kg daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
mofed-ubuntu22.04-xxxxxxxxx-ds 36 36 36 36 36 feature.node.kubernetes.io/kernel-version.full=5.15.0-134-generic,feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04 24h
rdma-shared-dp-ds 36 36 36 36 36 feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=falseThe installation of Nvidia Network Operator includes Ofed drivers and Device Plugin. The former requires privilege, which affects the IB driver on the host. During my testing, this led to a large number of errors on the IB card of an A100 node, the error logs filled the system disk, and interrupted the service for several hours.
After all Pods are Running, verify whether new resources have been added to the nodes:
$ kubectl get nodes -o json | jq -r '.items[] | {
name: .metadata.name,
"rdma/rdma_v100": .status.capacity["rdma/rdma_v100"]
} | select(.["rdma/rdma_v100"] != null)'
# Omit the same results
{
"name": "xxx-v100-xx",
"rdma/rdma_v100": "63"
}
{
"name": "xxx-a100-xx",
"rdma/rdma_a100": "63"
}At this point, the installation method based on the RDMA Shared Device Plugin has been completed. Some products in ByteDance's Volcano Engine seem to use this method.
2. Attempt to configure GPUDirect Workloads (unsuccessful)
[!quote] Network Operator Deployment for GPUDirect Workloads
This section is mainly a record of the failed attempts during the process. If you are more interested in how to verify the RDMA Shared Device Plugin later, you can directly jump to the next section.
During the configuration of the RDMA Shared Device Plugin (referred to as Method 1 for short), I encountered some other issues, which led me to mistakenly believe that the path of Method 1 was not viable. In the discussion area of the K8s RDMA Shared Dev Plugin project, someone also said the following3 (although there were counterexamples below, I didn't get it working at the time, and thought it was outdated):
[!quote] Adrian Chiris
We should improve the projects README.
the general way to use it with k8s is utilizing secondary network CNI such as macvlan or ipoib (or any CNI essentially can create virtual interfaces on top of existing RDMA capable parent netdev) The general way to use it with k8s is to use a secondary network CNI, such as macvlan or ipoib (or any CNI essentially can create virtual interfaces on top of existing RDMA capable parent netdev)
we should update instructions and examples.
So I read the documentation again and found a section called [GPUDirect Workloads] (inner OS: Are other installation methods not for GPU Workloads?)
Compared to Method 1, this method requires installing the DOCA driver, SR-IOV Device Plugin, Secondary network, Multus CNI, Container Networking plugins, IPAM plugin, where Multus CNI is a secondary network CNI in Kubernetes4.
[!quote]
- Multus is a CNI (container network interface) plugin that allows multiple network cards to be inserted into a Kubernetes Pod, thereby achieving more flexible network communication. It supports multiple CNI plugins, such as Flannel, Calico, Macvlan, etc., and can be well integrated with other network solutions. In some scenarios, it may be necessary for a Pod to connect to multiple different networks, and Multus can achieve this function, providing multiple network interfaces for the Pod, allowing it to communicate with different networks.
- Whereabouts is an IP address management tool that can automatically assign IP addresses to Pods and avoid IP address conflicts. In traditional network configurations, it may be necessary to manually assign different IP address ranges to each host to prevent IP address conflicts. Whereas Whereabouts simplifies this process through its automated IP address assignment mechanism, making IP address management in a Kubernetes cluster more efficient and reliable. It ensures that each Pod gets a unique IP address, even in large-scale cluster environments, effectively avoiding IP address duplication issues.
During deployment, first install the Nic Cluster Policy:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.01-0.6.0.0-0
forcePrecompiled: false
imagePullSecrets: []
terminationGracePeriodSeconds: 300
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
safeLoad: false
drain:
enable: true
force: true
podSelector: ""
timeoutSeconds: 300
deleteEmptyDir: true
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: v3.9.0
imagePullSecrets: []
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"devices": [],
"drivers": [],
"pfNames": [],
"pciAddresses": [],
"rootDevices": [],
"linkTypes": [],
"isRdma": true
}
}
]
}
secondaryNetwork:
cniPlugins:
image: plugins
repository: ghcr.io/k8snetworkplumbingwg
version: v1.5.0
imagePullSecrets: []
multus:
image: multus-cni
repository: ghcr.io/k8snetworkplumbingwg
version: v4.1.0
imagePullSecrets: []
ipamPlugin:
image: whereabouts
repository: ghcr.io/k8snetworkplumbingwg
version: v0.7.0
imagePullSecrets: []Afterwards, we need to specify the assignable IP for Where Abouts, which cannot be repeated with the IP addresses currently in use under the current Layer 2 network (this is somewhat similar to what Metal LB does). Therefore, I first scanned and selected an unused small IP segment.
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
name: hostdevice-net
spec:
networkNamespace: "crater-workspace" # Namespace where workloads are located
resourceName: "hostdev"
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.x.152/27",
"exclude": ["192.168.x.151/32"],
"log_file": "/var/log/whereabouts.log",
"log_level": "info"
}After successful installation, the nodes will have resources of type nvidia.com/hostdev:
$ kubectl get nodes -o json | jq -r '.items[] | {
name: .metadata.name,
"nvidia.com/hostdev": .status.capacity["nvidia.com/hostdev"]
} | select(.["nvidia.com/hostdev"] != null)'
# Omit the same results
{
"name": "xxx-v100-xx",
"nvidia.com/hostdev": "2"
}
{
"name": "xxx-a100-xx",
"nvidia.com/hostdev": "4"
}To use this special network, we also need to add annotations when submitting the Pod:
apiVersion: v1
kind: Pod
metadata:
name: testpod1
namespace: crater-workspace. # The namespace specified earlier
annotations:
k8s.v1.cni.cncf.io/networks: hostdevice-net
spec:
containers:
- name: appcntr1
image: <image>
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"] # This is required
command:
- sh
- -c
- sleep inf # The official documentation writes it this way, so how should I test?
resources:
requests:
nvidia.com/hostdev: "1"
nvidia.com/gpu: "1"
limits:
nvidia.com/hostdev: "1"
nvidia.com/gpu: "1"After entering the Pod, running the ifconfig command, we find that a network card named net1 has been added. However, what to do next? Although the project repository of Network Operator provides test files5, the commands are also sleep inf.
I guess it might be that NCCL needs to specify the network card, etc. Since the RDMA Shared Device Plugin later ran through, I didn't further explore this part, maybe raising my confusion to the official is also a good choice.
To clean up stale resources, you can start kubectl proxy in one terminal:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001And in another terminal, run the cleanup script (note / needs to be escaped as ~1):
#!/bin/bash
# Check if at least one node name is provided
if [ "$#" -lt 1 ]; then
echo "Usage: $0 <node-name> [<node-name>...]"
exit 1
fi
# Prepare the JSON patch data
PATCH_DATA=$(cat <<EOF
[
{"op": "remove", "path": "/status/capacity/nvidia.com~1hostdev"}
]
EOF
)
# Iterate over each node name provided as an argument
for NODE_NAME in "$@"
do
# Execute the PATCH request
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data "$PATCH_DATA" \
http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status
echo "Patch request sent for node $NODE_NAME"
donePass the node name and clean up:
chmod +x ./patch_node_gpu.sh
./patch_node_gpu.sh node1 node2Verifying RDMA Installation
In this section, we will introduce how to continue verifying the RDMA installation based on the RDMA Shared Device Plugin method.
1. Preparing an RDMA-Supporting Image
[!quote] Verify if the image supports RDMA--Machine Learning Platform - Volcano Engine
A simple Dockerfile suitable for the V100 machine may look like this:
FROM xxx/envd:py3.12-ubuntu22.04-8978
USER root
# Install APT packages
RUN apt-get update && apt-get install -y \
infiniband-diags perftest ibverbs-providers libibumad3 \
libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 && \
rm -rf /var/lib/apt/lists/*
# No Python dependencies specifiedHere, my base image already includes commonly used debugging toolkits, Python, and CUDA environments. Mainly through APT to continue installing libraries related to InfiniBand.
After installing these libraries, if we start a Pod without requesting RDMA resources, we can normally see the content of ibstat, but if we try to perform write operations, it will report that there is no InfiniBand or RoCE device.
2. Verification Method on a Single Machine
First, we need to start a Pod that requests RDMA resources:
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod-1
spec:
containers:
- image: <image>
name: rdma-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
requests:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
command:
- sh
- -c
- |
sleep infinityFor regular GPU resources, we have renamed them according to the model, and related information can be found in previous articles.
After the container starts successfully, enter the container:
- Enter the following command:
ib_write_bw -d mlx5_1 &Sample output:
$ ib_write_bw -d mlx5_1 &
[1] 2457716
root@xxx-01:~#
************************************
* Waiting for client to connect... *
************************************- Enter the following command on the same machine:
ib_write_bw -d mlx5_1 127.0.0.1 --report_gbitsSample output:
$ ib_write_bw -d mlx5_1 127.0.0.1 --report_gbits
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
---------------------------------------------------------------------------------------
Number of qps : 1 Transport type : IB
RDMA_Write BW Test
Connection type : RC Using SRQ : OFF
Dual-port : OFF Device : mlx5_1
PCIe relax order: ON
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
CQ Moderation : 1
Mtu : 4096[B]
Mtu : 4096[B]
Link type : IB
Link type : IB
Max inline data : 0[B]
Max inline data : 0[B]
rdma_cm QPs : OFF
rdma_cm QPs : OFF
Data ex. method : Ethernet
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
local address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
local address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
remote address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
remote address: LID 0xXX QPN 0xXXXX PSN 0xXXXXXX RKey 0xXXXXXX VAddr 0xXXXXXXXXXXXX
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1000.000000 != 3013.932000. CPU Frequency is not max.
65536 5000 94.72 94.71 0.180640
---------------------------------------------------------------------------------------
65536 5000 94.72 94.71 0.180640
---------------------------------------------------------------------------------------
[1]+ Done ib_write_bw -d mlx5_1For V100 RDMA machines, the bandwidth value (BW peak, BW average) should be close to 100Gb/s, and for A100 RDMA machines, it should be close to 200Gb/s. If it meets the requirements, it indicates that the configuration is correct. If there is no output or error, please go back to the section of configuring the environment according to the machine model and check for any missing configuration items.
3. Verification Method on Multiple Machines
Similar to the second section, apply for two Pods respectively and record the Kubernetes internal IP address of one of the Pods, then run the command:
# server cmd
ib_write_bw -a -F --report_gbits -q 2
# client cmd
ib_write_bw -a -F --report_gbits -q 2 <server-pod-default-network-IP>The bandwidth value is also close to 100Gb/s, indicating that the connection between multiple machines is normal.
4. vLLM Multi-Machine Distributed Inference Practice
Finally, we tested running vLLM multi-machine distributed inference of the DeepSeek R1 Distill Qwen 32B model through Volcano Job. Our model is mounted through PVC, and the image is made through Envd. Since vLLM will install a custom CUDA 12.4, the base image does not need to contain CUDA.
# syntax=v1
def build():
base(image="ubuntu:22.04",dev=True)
install.python(version="3.12")
install.apt_packages([
"openssh-server", "build-essential", "iputils-ping", "net-tools", "htop",
"infiniband-diags", "perftest", "ibverbs-providers", "libibumad3",
"libibverbs1", "libnl-3-200", "libnl-route-3-200", "librdmacm1"
])
config.pip_index(url = "https://pypi.tuna.tsinghua.edu.cn/simple")
install.python_packages(name = ["vllm"])
config.jupyter()Afterwards, we started the Volcano Job:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vllm-rdma-test
namespace: crater-workspace
spec:
maxRetry: 3
minAvailable: 2
plugins:
pytorch:
- --master=master
- --worker=worker
- --port=23456
svc: []
policies:
- action: RestartJob
event: PodEvicted
queue: default
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: master
policies:
- action: CompleteJob
event: TaskCompleted
- action: TerminateJob
event: PodFailed
replicas: 1
template:
spec:
containers:
- command:
- sh
- -c
- |-
ray start --head --port=6667 --disable-usage-stats;
NCCL_DEBUG=TRACE python3 -m vllm.entrypoints.openai.api_server \
--model=/models/DeepSeek-R1-Distill-Qwen-32B \
--max-model-len 32768 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128 \
--trust-remote-code \
--disable-custom-all-reduce \
--port 6666 \
--dtype=half;
image: xxx/envd-vllm:0.8.3-cu12.4-rdma-v1
name: master
resources:
limits:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
requests:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
securityContext:
capabilities:
add:
- IPC_LOCK
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: crater-cache
- mountPath: /models/DeepSeek-R1-Distill-Qwen-32B
name: crater-ro-storage
readOnly: true
subPath: LLM/deepseek/DeepSeek-R1-Distill-Qwen-32B
workingDir: /models
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: crater-cache
- name: crater-ro-storage
persistentVolumeClaim:
claimName: crater-ro-storage
- maxRetry: 3
minAvailable: 1
name: worker
replicas: 1
template:
spec:
containers:
- command:
- sh
- -c
- |-
ray start --address="$MASTER_ADDR:6667";
sleep infinity;
image: xxx/envd-vllm:0.8.3-cu12.4-rdma-v1
name: worker
resources:
limits:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
requests:
nvidia.com/v100: "4"
rdma/rdma_v100: "1"
securityContext:
capabilities:
add:
- IPC_LOCK
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: crater-cache
- mountPath: /models/DeepSeek-R1-Distill-Qwen-32B
name: crater-ro-storage
readOnly: true
subPath: LLM/deepseek/DeepSeek-R1-Distill-Qwen-32B
workingDir: /models
restartPolicy: OnFailure
volumes:
- emptyDir:
medium: Memory
name: crater-cache
- name: crater-ro-storage
persistentVolumeClaim:
claimName: crater-ro-storage
ttlSecondsAfterFinished: 259200According to the vLLM documentation on distributed inference6, we enabled NCCL_DEBUG=TRACE, and in the logs, we can see that NCCL used IB instead of Socket connections.
During the inference process, Kubernetes also does not detect inter-machine communication traffic, indicating that our deployment has been successful.
Lifting the Memory Lock Limit
0. Problem Description
The following errors are caused by the memory lock limit.
The following error occurs during connectivity testing:
[host1] $ ib_read_bw -q 30
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 30 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients[host2] $ ib_read_bw -q 30 10.244.46.50
Couldn't allocate MR
failed to create mr
Failed to create MR
Couldn't create IB resourcesWhen performing RDMA bandwidth testing, small data volume (e.g., 1024 bytes) transmission is normal, but 1M and above data volume transmission fails. The specific error message is similar to above, see Issue #339.
Directly using ulimit -l to view the limit, what we get is 64KB, which is the system's default memory lock limit.
At this time, using ulimit -l unlimited in the container or modifying /etc/security/limits.conf will not successfully change the memory lock limit.
1. Problem Analysis
The core of RDMA is letting the hardware (network card HCA) bypass the CPU and directly access remote memory. In normal operations, to optimize memory, the operating system will "page out" or move data at any time, causing the physical address corresponding to the virtual address to change. If the kernel pages out or moves a page of memory while the network card is transmitting data, it will lead to a transmission crash or even a system error.
So, to ensure the address is absolutely stable, RDMA must perform "Memory Registration" (MR) before transmission. This action "locks" the specified virtual memory pages into physical memory at the kernel level, prohibiting the kernel from moving or swapping them to disk.
Crater has already allowed jobs to perform memory locking operations by setting CAP_IPC_LOCK, but it hasn't modified the limit on the allowed locking amount. Also, since the container isn't given the CAP_SYS_RESOURCE privilege, operations inside the container cannot modify this limit either.
Currently, Kubernetes officially doesn't provide a ulimit setting in the Pod Spec, see Issue #3595. Therefore, it's necessary to handle it from the container runtime perspective.
Additionally, it's worth looking forward to the fact that the Kubernetes community has officially started a discussion on native support for Pod-level ulimit configuration in the v1.36 cycle, see KEP-5758.
The dockerd runtime provides the corresponding configuration item default-ulimits, which can easily configure this limit at the node level, see configuration file description and resource limit guide.
However, containerd is currently used in the cluster, and it doesn't provide a configuration item similar to dockerd, see configuration file description. Meanwhile, the developer explicitly refused to provide daemon-level configuration in Issue #3150. Therefore, another way is needed to solve this problem.
Additionally, after trying, directly modifying
/etc/security/limits.confon the node to lift the limit, or only settingLimitMEMLOCK=infinityfor the container runtime daemon also cannot release the limit inside the Pod. The root cause is speculated to be that the container executorruncwill strictly follow the definition in the OCI specification blueprint and perform thesetrlimitsystem call for the container at startup. This will forcibly override (usually pull down) the container's limit to the default value in the blueprint (e.g., 64KB), thus ignoring the daemon's own authority limit.
2. Solution
The core of this solution lies in modifying the OCI Runtime Specification. In the context of containerd, the JSON file specified by the base_runtime_spec parameter is considered the container's base specification template. It is the final authoritative source for defining container resource boundaries (such as memlock limit) (see OCI Configuration Specification). The underlying OCI runtime (such as runc) will strictly follow the blueprint's definition to initialize the container process's resource limit through the setrlimit system call.
Exporting and Modifying the Existing Configuration
Since containerd doesn't support automatically merging this parameter with the system default configuration (Merge), but takes a full replacement strategy (Replace), we must first export a configuration template containing the current system's full definition, and then fine-tune on its basis. Run the following command on the node.
ctr oci spec > /etc/containerd/rdma-spec.json
vim /etc/containerd/rdma-spec.jsonModify the following part of the configuration file (append RLIMIT_MEMLOCK to the rlimits array):
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 1024,
"soft": 1024
},
{
"type": "RLIMIT_MEMLOCK",
"hard": 18446744073709551615,
"soft": 18446744073709551615
}
],Referencing the Modified Configuration
Modify the containerd configuration.
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
# Modify to point to the absolute path of the above JSON template
base_runtime_spec = "/etc/containerd/rdma-spec.json"
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
privileged_without_host_devices_all_devices_allowed = false
runtime_engine = ""
runtime_path = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
sandbox_mode = "podsandbox"
snapshotter = ""It needs to be modified according to the specific situation, with the goal of letting the corresponding Pod successfully apply the corresponding configuration.
After modifying the container runtime configuration and restarting the service, it will not affect the Pods that are already running. You must manually delete the old Pods, and have Kubernetes reschedule and trigger containerd to use the new OCI blueprint for initialization, then the configuration will truly take effect.
Lifting containerd's Own Limit (Optional)
EDITOR=vim systemctl edit containerdAdd the following content to the daemon configuration.
[Service]
LimitMEMLOCK=infinityThen use systemctl restart containerd to restart the service.
Although experiments show that because containerd has CAP_SYS_RESOURCE privilege, only modifying the OCI blueprint can take effect, completing this item is to ensure the robustness of the daemon itself, and it is recommended as a best practice in production environments.
Testing
By now, the memory lock limit in the Pod should have been lifted, and we can run the RDMA bandwidth test command again, being able to see the following output on the master and worker respectively.
Master:
root@pyt-liuyizhou-260206-2a3fe-master-0:/# ib_write_bw -s 1M -d mlx5_0 -F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x2c QPN 0x0087 PSN 0xcc6a4c RKey 0x1fcc bd VAddr 0x037f67d3b5030
remote address: LID 0x2c QPN 0x0088 PSN 0x6dc93a RKey 0x1fcf be VAddr 0x037f026db4030
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1048576 5000 11133.60 11129.09 0.011123
---------------------------------------------------------------------------------------Worker:
root@pyt-liuyizhou-260206-2a3fe-worker-0:/# ib_write_bw -s 1M -d mlx5_0 -F 10.244.44.139
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x2c QPN 0x0088 PSN 0x6dc93a RKey 0x1fcf be VAddr 0x037f026db4030
remote address: LID 0x2c QPN 0x0087 PSN 0xcc6a4c RKey 0x1fcc bd VAddr 0x037f67d3b5030
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1048576 5000 11133.60 11129.09 0.011123
---------------------------------------------------------------------------------------
root@pyt-liuyizhou-260206-2a3fe-worker-0:/# ulimit -l
unlimitedAs can be seen, the memory lock limit inside the Pod has been lifted, and normal RDMA communication with large bandwidth can be performed.
Note that if the node kernel or related components are updated, it may be necessary to lift the memory lock limit again.
Tolerance of RDMA Tools Pods to Crater Taints
0. Problem Description
After the node kernel upgrade, it was observed that RDMA jobs could not be scheduled to the node, but non-RDMA jobs could.
The reason is that the node did not report RDMA resources, even though the relevant network cards and drivers were working normally.
1. Problem Analysis
Resource Discovery Flow and Analysis of Blocking Points
In normal circumstances, the NVIDIA Network Operator automatically manages the node's RDMA resources through the following flow. This flow is re-triggered when the kernel is updated:
- Status Monitoring and Change Locking (Network Operator): The Operator monitors the host's kernel version in real-time. Once a kernel change is detected, it immediately sets the node's
network.nvidia.com/operator.mofed.waitlabel totrue. This action "locks" the resource discovery path to prevent allocating incorrect resources before the driver compatibility is confirmed. - Driver Environment Verification and Reporting (MOFED Driver Pod): The Operator attempts to schedule and run the
mofed-driverPod on each node. This Pod is responsible for checking, installing, or updating the host's Mellanox driver. Only after this Pod successfully runs on the corresponding node and returns a readiness signal will the Operator update themofed.waitlabel tofalse.- Core Blocking Point: If the node has custom business taints (e.g.,
crater.raids.io/accountexclusive) or is in maintenance mode (Cordoned, with thenode.kubernetes.io/unschedulabletaint), and theNicClusterPolicydoes not have the corresponding Tolerations configured, the driver Pod will be intercepted by the scheduler. This causes the flow to get stuck at this step, with the node label staying permanently attrue, thereby locking resource discovery.
- Core Blocking Point: If the node has custom business taints (e.g.,
- Plugin Activation and Resource Exposure (RDMA Shared Device Plugin Pod): Once
mofed.waitbecomesfalse, the locked RDMA Shared Device Plugin Pod will start its detection logic, identify physical network cards, and report resources such asrdma/rdma_v100to the Kubelet.
Driver Self-healing Mechanism (DKMS)
In this case, although the K8s control plane's resource discovery flow was blocked due to scheduling permission issues, the host's physical data plane was actually normal. This is because:
- DKMS (Dynamic Kernel Module Support): Host-level DKMS is configured. During the first boot after the kernel upgrade, the system automatically completed the reconstruction and loading of kernel modules in the background.
- Cognitive Bias: A "cognitive bias" occurred where the physical driver was ready, but the Operator incorrectly believed the driver was invalid because the verification Pod could not enter the node.
2. Solution
The key to solving this problem is to supplement the necessary toleration configurations for the NicClusterPolicy.
Configuration Location and Verification
Note that different versions of the NVIDIA Network Operator have different field support for NicClusterPolicy. Before attempting to modify, be sure to verify the configuration structure.
For example, in the v1alpha1 version, tolerations are usually configured globally under spec, rather than inside sub-components like ofedDriver or rdmaSharedDevicePlugin. You can confirm this with the following commands:
# Check if sub-components support tolerations (if not, it will prompt 'unknown field')
kubectl explain nicclusterpolicy.spec.ofedDriver
# Check if the spec level supports global tolerations
kubectl explain nicclusterpolicy.specImplementation of Precise Patch (Patch)
After determining the path, it is recommended to use kubectl patch for "surgical" modification, avoiding the introduction of system metadata or status field conflicts from directly applying an exported YAML file.
Note: Patch operations are persistent modifications directly applied to the cluster instance. If your cluster is deployed via Helm and has resource auto-sync enabled (e.g., ArgoCD), performing a direct Patch may be overwritten by the upper-level tool. In this case, it is recommended to modify the Helm values.yaml configuration synchronously.
Before formal execution, be sure to add --dry-run=server for a server-side dry-run test to verify the correctness of the field path and syntax:
kubectl patch nicclusterpolicy nic-cluster-policy --type='merge' -p '{
"spec": {
"tolerations": [
{
"key": "crater.raids.io/account",
"operator": "Exists",
"effect": "NoSchedule"
},
{
"key": "node.kubernetes.io/unschedulable",
"operator": "Exists",
"effect": "NoSchedule"
}
]
}
}' --dry-run=serverIf the dry-run output is patched and there are no errors, you can remove --dry-run=server to take effect formally.
Verification Steps
- Observe Pod Status: Run
kubectl get pods -n nvidia-network-operator -wto confirm thatmofed-driverandrdma-shared-dpPods successfully start on the corresponding node. - Check Node Status: Run
kubectl get node [node-name] -L network.nvidia.com/operator.mofed.waitto confirm that the label changes back tofalse. - Confirm Resource Reporting: Run
kubectl describe node [node-name]. You should seerdma/rdma_v100again in theAllocatablelist.
Problem Records
1. Error Segmentation fault when starting vLLM
From the logs, the IB device has been successfully recognized, but a segmentation fault occurred.
[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
[device-name]-master-0:528:528 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
self.device_communicator = device_comm_cls(
[device-name]-master-0:528:528 [0] NCCL INFO P2P Chunksize set to 131072
[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [receive] via NET/IB/0
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [receive] via NET/IB/0
[device-name]-master-0:528:528 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
[device-name]-master-0:528:528 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
[device-name]-master-0:528:5379 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer [device-name]-worker-0.[hostname].svc.cluster.local<35396>
[device-name]-master-0:528:5379 [0] NCCL INFO misc/socket.cc:752 -> 6
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net_ib.cc:1207 -> 6
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net.cc:837 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO transport/net.cc:405 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO transport.cc:183 -> 6
^^^^^^^^^^^^^^^^
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1263 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1548 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1799 -> 6
File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
[device-name]-master-0:528:5379 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer [device-name]-worker-0.[hostname].svc.cluster.local<52144>
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
[device-name]-master-0:528:5379 [0] NCCL INFO misc/socket.cc:752 -> 6
^^^^^^^^^^^^^^^^^^^^^^^^^^^
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net_ib.cc:1207 -> 6
File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
File "/opt/conda/envs/envd/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
[device-name]-master-0:528:5379 [0] NCCL INFO transport/net.cc:837 -> 6
[device-name]-master-0:528:528 [0] NCCL INFO init.cc:1837 -> 6
*** SIGSEGV received at time=1745072123 on cpu 70 ***
PC: @ 0x7ff94e269506 (unknown) ncclProxyServiceUDS()
@ 0x7ffa0c242520 3384 (unknown)
@ ... and at least 1 more frames
[2025-04-19 14:15:23,982 E 528 5383] logging.cc:484: *** SIGSEGV received at time=1745072123 on cpu 70 ***
[2025-04-19 14:15:23,982 E 528 5383] logging.cc:484: PC: @ 0x7ff94e269506 (unknown) ncclProxyServiceUDS()
[2025-04-19 14:15:23,983 E 528 5383] logging.cc:484: @ 0x7ffa0c242520 3384 (unknown)
[2025-04-19 14:15:23,983 E 528 5383] logging.cc:484: @ ... and at least 1 more frames
Fatal Python error: Segmentation faultRemember we mentioned before that we need to add IPC_LOCK in the Pod's security context? If not added, it will lead to the above problem.
2. Failure of Multi-machine Inference on A100 Model
First, run single-machine verification on the A100 model. If an Up network card is used, there seems to be no problem:
$ ib_write_bw -d mlx5_1 &
[1] 1501
************************************
* Waiting for client to connect... *
************************************
$ ib_write_bw -d mlx5_1 127.0.0.1 --report_gbits
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 863.109000 != 2300.000000. CPU Frequency is not max.
65536 5000 183.66 183.61 0.350214
---------------------------------------------------------------------------------------
65536 5000 183.66 183.61 0.350214
---------------------------------------------------------------------------------------
$ ib_write_bw -d mlx5_0 &
[1] 1618
Port number 1 state is Down
Couldn't set the link layer
Couldn't get context for the device
$ ib_write_bw -d mlx5_0 127.0.0.1 --report_gbits
Port number 1 state is Down
Couldn't set the link layer
Couldn't get context for the device
$ ibstat
CA 'mlx5_0'
CA type: MT4123
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 3
LMC: 0
SM lid: 1
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4117
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Link layer: Ethernet
$ ib_write_bw -d mlx5_bond_0 &
IB device mlx5_bond_0 not found
Unable to find the Infiniband/RoCE deviceBut when running vLLM, an error will occur, and it was later found that this problem is related to vLLM V1 and Ray7, and not related to IB. Since I happened to have vLLM multi-machine distributed inference code in hand, I used it for testing. Actually, running something like NCCL Test might be better to avoid some out-of-field interference factors.
Summary
The above records the process of connecting RDMA in the local Kubernetes cluster. At present, the lack of relevant documents and the wide range of problem domains involved in the process are the main factors hindering learning in this area. Indeed, a relatively solid computer foundation is required.