LXCFS Configuration
Configure LXCFS to support resource view isolation
Background Introduction
Over the past two years, we have built a cloud-native machine learning platform based on Kubernetes, gradually replacing the original cluster scheduling tool based on Slurm.
In order to maintain as much compatibility as possible between the original approach and the container-based approach, we have made some attempts, but there are still some issues, such as resource visibility within containers—
User Story
Xiaoming is a graduate student in the field of deep learning and a user of the cloud-native machine learning platform.
One day, he applied for a Jupyter debug job on the platform. When starting the job, Xiaoming needed to select the number and type of CPU, Memory, and GPU. After that, the platform would render these limits into Kubernetes Pod Resources Requests and Limits:
resources:
limits:
cpu: "16"
memory: 32Gi
nvidia.com/a100: "1"
requests:
cpu: "16"
memory: 32Gi
nvidia.com/a100: "1"
After the job started, Xiaoming ran the nvidia-smi
command in the job, and it displayed one GPU normally. However, when running commands like lscpu
and top
, he saw CPU cores and memory capacity far exceeding the 16C 32G he applied for (which was actually the host machine's resource count):
$ top
MiB Mem : 385582.0 total, 258997.6 free, 24158.2 used, 105203.0 buff/cache
Xiaoming is not familiar with container technology. He thought the machine learning platform allocated resources similar to virtual machines, so he was a bit confused by this behavior.
Solution
The above issue not only affects user experience, but may also impact program performance. For programs such as Java and Go, for example, a Go program sets the GOMAXPROCS
variable when starting, indicating the maximum number of threads that can be executed1. However, in the container environment, the value of this variable is still the host machine's value. If too many threads are started on a few CPUs, it may cause frequent thread switching overhead, thereby slowing down the program's runtime speed.
We have two solutions for this:
- User-aware: In Slurm, the following environment variables are injected into the job to indicate the actual resources requested by the job2:
Variable Name | Explanation |
---|---|
SLURM_CPUS_ON_NODE | Number of CPUs on the allocated node |
SLURM_CPUS_PER_TASK | Number of CPUs per task |
SLURM_GPUS_PER_NODE | Number of GPUs required per node |
SLURM_MEM_PER_NODE | Amount of Mem required per node |
Similarly, we can inject related environment variables when starting the Pod and make an agreement with the user.
- User-unaware (but still has some limitations): For example, the LXCFS introduced below.
Introduction to LXCFS
LXCFS (Linux Container Filesystem) is a user-space filesystem implementation based on the FUSE filesystem, aiming to solve the inherent limitations of the proc filesystem (procfs) in Linux container environments.
Specifically, it provides two main features:
- A set of files that can be bind-mounted to their original
/proc
files to provide cgroup-aware values. - A container-aware tree similar to cgroupfs.
With LXCFS, when we query information such as /proc/cpuinfo
in the container, the content will be "hijacked" by LXCFS using the FUSE method. LXCFS will combine the container's cgroup
information to provide the correct result.
Limitations of Existing LXCFS for Kubernetes Solutions
[!quote]
The above idea is not difficult, and there are already many open-source solutions for LXCFS for Kubernetes:
Project | Notes |
---|---|
denverdino/lxcfs-admission-webhook | Most starred, but incomplete, and not maintained for a long time |
kubeservice-stack/lxcfs-webhook | Updated frequently, but has some errors (plans to submit a PR later) |
cndoit18/lxcfs-on-kubernetes | Less maintained |
(TODO: Provide a simple introduction to the principles of the above solutions; skip for now, readers can refer to the relevant blog)
However, after in-depth study and use of the above solutions, I found that these solutions have some issues to varying degrees:
1. Pod Resource Information Abnormal After Node Restart
[!quote]
kubeservice-lxcfs-webhook 1.4.0 · kubeservice/kubservice-charts
Container Lifecycle Hooks | Kubernetes
When LXCFS is running normally, the Pod can view the rewritten Uptime and other information:
$ top
top - 07:47:52 up 9 min, 0 users, load average: 0.00, 0.00, 0.00
However, if the node restarts, by default, LXCFS does not continue to rewrite the relevant information inside the Pod:
$ top
top: failed /proc/stat open: Transport endpoint is not connected
To solve this problem, the community has proposed corresponding solutions3. We can leverage the Kubernetes Container Lifecycle Hooks mechanism to remount the current Pod after the node restarts when LXCFS starts.
The above solution requires installing LXCFS on the node and configuring LXCFS to start automatically via Systemd. This is very uncloud-native. Therefore, we can mount the containerd-related socket in the LXCFS container, thus not relying on the host's capabilities.
2. LXCFS Container Exit and Re-creation Failure (Deadlock)
During my debugging, I found that if the LXCFS DaemonSet exits, then before the node restarts, re-installing the LXCFS DaemonSet will definitely fail:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
lxcfs-77c87 0/1 CreateContainerError 0 18m
This is because when LXCFS is created, it needs to mount the host's /var/lib/lxcfs
directory, but this directory is only successfully mounted after LXCFS is created, causing a deadlock.
To address this, we can use Kubernetes's Container Lifecycle Hooks mechanism to delete the relevant mount points before LXCFS exits.
preStop:
exec:
command:
- bash
- -c
- nsenter -m/proc/1/ns/mnt fusermount -u /var/lib/lxc/lxcfs 2> /dev/null || true
The above method is not foolproof. If the cleanup still fails, a node restart is required.
To solve this, we can create another volume declaration pointing to the parent directory of lxcfs and perform the unmounting of residual mounts in the init container, ensuring this is foolproof.
3. LXCFS Version is Relatively Outdated
Currently, LXCFS has been updated to version 6.0, but the mainstream version in the community is still 4.0.
However, higher versions of LXCFS require higher versions of glibc and other libraries, and the version to use should be selected based on the actual situation of the cluster.
4. Depends on the Host's libfuse.so
When deploying a DaemonSet in Kubernetes, there may be an error:
/usr/local/bin/lxcfs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory
To resolve the above issue, the first method is to install libfuse2
on the node (different for CentOS), and ensure that libfuse2
is installed on all nodes using Ansible:
- name: Ensure libfuse2 is installed
hosts: all
become: yes
gather_facts: yes
tasks:
- name: Check if libfuse2 is installed
apt:
name: libfuse2
state: present
register: libfuse2_installed
changed_when: libfuse2_installed.changed
$ ansible-playbook -i hosts lxcfs.yaml
PLAY [Ensure libfuse2 is installed]
TASK [Gathering Facts]
ok: [192.168.5.75]
ok: [192.168.5.1]
TASK [Check if libfuse2 is installed]
ok: [192.168.5.1]
changed: [192.168.5.75]
PLAY RECAP
192.168.5.1 : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.168.5.75 : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Another method is to modify the Dockerfile build method and startup script so that the final LXCFS container includes the required dynamic libraries when it runs.
Install LXCFS Webhook
To address the above issues, we have integrated and optimized multiple solutions to provide Yet Another LXCFS Webhook.
1. Dependencies
First, install Cert Manager (if not already installed):
helm repo add jetstack https://charts.jetstack.io --force-update
To install the cert-manager Helm chart, use the Helm install command as follows.
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.17.2 \
--set crds.enabled=true
2. Install via Helm
After cloning the code, install via Helm:
helm upgrade --install lxcfs-webhook ./dist/chart -n lxcfs
This includes the LXCFS DaemonSet, Webhook, and solves issues such as node restarts and Daemon restarts.
3. Specify Scope
Then, you can add labels to the namespace:
kubectl label namespace <namespace-name> lxcfs-admission-webhook:enabled
Pods within the corresponding namespace will automatically mount LXCFS when created.
LXCFS Webhook Design
1. LXCFS DaemonSet Image Building
To build an image not dependent on the host's libfuse.go
, we first check the location of ldconfig -p | grep libfuse.so.2
:
$ ldconfig -p | grep libfuse.so.2
libfuse.so.2 (libc6,x86-64) => /lib/x86_64-linux-gnu/libfuse.so.2
$ ldconfig -p | grep libulockmgr.so
libulockmgr.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libulockmgr.so.1
libulockmgr.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libulockmgr.so
$ ls /lxcfs/build/
build.ninja config.h lxcfs lxcfs.spec meson-private
compile_commands.json liblxcfs.so lxcfs.1 meson-info share
config liblxcfs.so.p lxcfs.p meson-logs tests
Then, for the Ubuntu operating system, we perform a two-stage build:
# LXCFS Builder Image
# Builds LXCFS from source on Ubuntu 22.04
FROM crater-harbor.act.buaa.edu.cn/docker.io/ubuntu:22.04 AS build
# Environment configuration
ENV DEBIAN_FRONTEND=noninteractive \
LXCFS_VERSION=v6.0.4
# Install build dependencies
RUN apt-get update && \
apt-get --purge remove -y lxcfs && \
apt-get install -y --no-install-recommends \
build-essential \
cmake \
fuse3 \
git \
help2man \
libcurl4-openssl-dev \
libfuse-dev \
libtool \
libxml2-dev \
m4 \
meson \
mime-support \
pkg-config \
python3-pip \
systemd \
wget \
autotools-dev \
automake && \
rm -rf /var/lib/apt/lists/*
# Install Python dependencies
RUN pip3 install --no-cache-dir -U jinja2 \
-i https://mirrors.aliyun.com/pypi/simple/
# Download and build LXCFS from source
RUN wget https://linuxcontainers.org/downloads/lxcfs/lxcfs-${LXCFS_VERSION}.tar.gz && \
mkdir /lxcfs && \
tar xzvf lxcfs-${LXCFS_VERSION}.tar.gz -C /lxcfs --strip-components=1 && \
cd /lxcfs && \
make && \
make install && \
rm -f /lxcfs-${LXCFS_VERSION}.tar.gz
FROM crater-harbor.act.buaa.edu.cn/docker.io/ubuntu:22.04
STOPSIGNAL SIGINT
COPY --from=build /lxcfs/build/lxcfs /lxcfs/lxcfs
COPY --from=build /lxcfs/build/liblxcfs.so /lxcfs/liblxcfs.so
COPY --from=build /lib/x86_64-linux-gnu/libfuse.so.2.9.9 /lxcfs/libfuse.so.2.9.9
COPY --from=build /lib/x86_64-linux-gnu/libulockmgr.so.1.0.1 /lxcfs/libulockmgr.so.1.0.1
CMD ["/bin/false"]
Here, we move the related dynamic libraries to the /lxcfs
temporary directory first, otherwise they may be overwritten by HostPath. Then, we write a startup script, in which we move the related dynamic libraries back:
#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status.
# Cleanup
nsenter -m/proc/1/ns/mnt fusermount -u /var/lib/lxc/lxcfs 2> /dev/null || true
nsenter -m/proc/1/ns/mnt [ -L /etc/mtab ] || \
sed -i "/^lxcfs \/var\/lib\/lxc\/lxcfs fuse.lxcfs/d" /etc/mtab
# Prepare
mkdir -p /usr/local/lib/lxcfs /var/lib/lxc/lxcfs
# Update lxcfs
cp -f /lxcfs/lxcfs /usr/local/bin/lxcfs
cp -f /lxcfs/liblxcfs.so /lib/x86_64-linux-gnu/liblxcfs.so
cp -f /lxcfs/libfuse.so.2.9.9 /lib/x86_64-linux-gnu/libfuse.so.2.9.9
cp -f /lxcfs/libulockmgr.so.1.0.1 /lib/x86_64-linux-gnu/libulockmgr.so.1.0.1
# Remove old links
rm -f /lib/x86_64-linux-gnu/libfuse.so.2 /lib/x86_64-linux-gnu/libulockmgr.so.1 /lib/x86_64-linux-gnu/libulockmgr.so
# Create new links
ln -s /lib/x86_64-linux-gnu/libfuse.so.2.9.9 /lib/x86_64-linux-gnu/libfuse.so.2
ln -s /lib/x86_64-linux-gnu/libulockmgr.so.1.0.1 /lib/x86_64-linux-gnu/libulockmgr.so.1
ln -s /lib/x86_64-linux-gnu/libulockmgr.so.1.0.1 /lib/x86_64-linux-gnu/libulockmgr.so
# Update library cache
nsenter -m/proc/1/ns/mnt ldconfig
# Mount
exec nsenter -m/proc/1/ns/mnt /usr/local/bin/lxcfs /var/lib/lxc/lxcfs/ --enable-cfs -l -o nonempty
2. LXCFS Webhook Function Design
The functionality of the Webhook is simple. Based on the Kubebuilder framework, we can quickly build a Webhook. We implemented Mutation and Validate Webhooks. In Validate, we mainly check if the Pod and LXCFS rules ignore-related Annotations have the correct values.
In Mutation, we first check if the Pod needs to be mutated. If yes, we label the Pod as mutated and add LXCFS Volumes and VolumeMounts to the Pod.
// Default implements webhook.CustomDefaulter so a webhook will be registered for the Kind Pod.
func (d *PodLxcfsDefaulter) Default(ctx context.Context, obj runtime.Object) error {
pod, ok := obj.(*corev1.Pod)
if !ok {
return fmt.Errorf("expected an Pod object but got %T", obj)
}
podlog.Info("Defaulting for Pod", "name", pod.GetName(), "namespace", pod.GetNamespace())
// Check if the Pod should be mutated
if !mutationRequired(pod) {
podlog.Info("Skipping mutation for Pod", "name", pod.GetName(), "namespace", pod.GetNamespace())
return nil
}
// If the Pod is not mutated, we need to add the annotation
if pod.Annotations == nil {
pod.Annotations = make(map[string]string)
}
pod.Annotations[AdmissionWebhookAnnotationStatusKey] = StatusValueMutated
// Add LXCFS VolumeMounts to all containers
for i := range pod.Spec.Containers {
container := &pod.Spec.Containers[i]
if container.VolumeMounts == nil {
container.VolumeMounts = make([]corev1.VolumeMount, 0)
}
container.VolumeMounts = append(container.VolumeMounts, VolumeMountsTemplate...)
}
// Add LXCFS VolumeMounts to pod
if pod.Spec.Volumes == nil {
pod.Spec.Volumes = make([]corev1.Volume, 0)
}
pod.Spec.Volumes = append(pod.Spec.Volumes, VolumesTemplate...)
return nil
}
Verification
Apply for 1c 2G, and check CPU and Memory in the container:
$ cat /proc/meminfo | grep MemTotal:
MemTotal: 2097152 kB
$ cat /proc/cpuinfo | grep processor
processor : 0
$ cat /proc/cpuinfo | grep processor | wc -l
1
Summary
Through the above solutions, we can make the debug jobs in the machine learning platform more like virtual machines, reducing the cognitive burden on users. However, the LXCFS solution still has some limitations, such as the commonly used nproc
command still displaying the host machine's information4.
Users of the machine learning platform usually have limited knowledge of container technology. How to make them understand the reasons for these inconsistencies and the solutions remains a problem that troubles us.
Footnotes
-
Container Resource Visibility Issues and GOMAXPROCS Configuration · Issue #216 · islishude/blog ↩
-
Slurm Job Scheduling System User Guide | Supercomputing Center of USTC ↩
-
lxcfs-admission-webhook/lxcfs-image/start.sh at 23298354a1d3cd6eaeb76417aa3fea75df5cdd63 · ThinkBlue1991/lxcfs-admission-webhook · GitHub ↩