Crater Cluster Deployment Guide
This guide outlines the standardized deployment process of the Crater platform on a production-grade Kubernetes cluster, including environment preparation, dependency component installation, core service deployment, and common troubleshooting.
1. Environment and Prerequisites
| Item | Description |
|---|---|
| Operating System | Ubuntu 22.04 |
| Kubernetes | v1.31.x |
| Container Runtime | containerd 1.7.x |
| Helm | v3.x |
| Node Configuration | 1 control node, ≥ 2 worker nodes |
| Network Requirements | Nodes must be 互通 via internal network; at least one node must have external network access or proxy configuration |
Example:
kubectl get nodes -o wideSample Output:
NAME STATUS ROLES VERSION INTERNAL-IP OS-IMAGE
node-1 Ready control-plane v1.31.2 <your-node-ip> Ubuntu 22.04.5 LTS
node-2 Ready <none> v1.31.2 <your-node-ip> Ubuntu 22.04.5 LTS
node-3 Ready <none> v1.31.2 <your-node-ip> Ubuntu 22.04.5 LTS2. Installation of Cluster Dependencies
Crater platform relies on the following foundational components to enable scheduling, monitoring, networking, and storage functionality.
This section includes installation commands and image information for each component.
2.1 Metrics Server
Function: Provides CPU and memory metrics for nodes and pods, enabling HPA-based auto-scaling.
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm repo update
helm pull metrics-server/metrics-server --untar --destination <your-path>/charts
helm install metrics-server <your-path>/charts/metrics-server \
-n kube-system --create-namespaceImages:
registry.k8s.io/metrics-server/metrics-server:v0.8.0
# For users in China, replace with:
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/metrics-server/metrics-server:v0.8.0Verification:
kubectl get pods -n kube-system | grep metrics-server
kubectl top nodes2.2 NVIDIA GPU Operator
Function: Installs GPU drivers, device plugins, and monitoring components.
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm pull nvidia/gpu-operator --untar --destination <your-path>/charts
helm install gpu-operator <your-path>/charts/gpu-operator \
-n gpu-operator --create-namespaceKey Images:
| Component | Image |
|---|---|
| Driver Container | nvcr.io/nvidia/driver:525.125.06 |
| Device Plugin | nvcr.io/nvidia/k8s-device-plugin:v0.15.0 |
| DCGM Exporter | nvcr.io/nvidia/dcgm-exporter:3.1.6-3.1.3-ubuntu22.04 |
| MIG Manager | nvcr.io/nvidia/mig-manager:0.6.0 |
| Node Feature Discovery | ghcr.io/kubernetes-sigs/node-feature-discovery:v0.16.1 |
Verification:
kubectl get pods -n gpu-operator
nvidia-smi2.3 CloudNativePG (PostgreSQL Operator)
Function: Provides high-availability PostgreSQL database services.
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
helm pull cnpg/cloudnative-pg --untar --destination <your-path>/charts
helm install cnpg <your-path>/charts/cloudnative-pg \
-n cnpg-system --create-namespaceImages:
ghcr.io/cloudnative-pg/postgresql:16.3
ghcr.io/cloudnative-pg/cloudnative-pg:1.24.0Create a sample database cluster:
cat <<EOF | kubectl apply -f -
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: crater-postgresql
namespace: cnpg-system
spec:
instances: 3
storage:
size: 10Gi
EOF2.4 NFS Storage System
Function: Provides shared storage (ReadWriteMany mode) for user workspaces and public data.
2.4.1 Install NFS Server (Optional)
Run on a node with persistent disk:
sudo apt update
sudo apt install -y nfs-kernel-server
sudo mkdir -p /data/nfs
sudo chown -R nobody:nogroup /data/nfs
sudo chmod 777 /data/nfs
echo "/data/nfs *(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
sudo exportfs -a
sudo systemctl enable --now nfs-serverVerification:
showmount -e <your-node-ip>2.4.2 Create NFS StorageClass
Run in the Kubernetes cluster:
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs
provisioner: nfs-provisioner
parameters:
archiveOnDelete: "false"
reclaimPolicy: Retain
volumeBindingMode: Immediate
EOF2.4.3 Deploy NFS Provisioner
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
helm repo update
helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
-n nfs-system --create-namespace \
--set nfs.server=<your-node-ip> \
--set nfs.path=/data/nfs \
--set storageClass.name=nfs \
--set storageClass.defaultClass=trueVerification:
kubectl get pods -n nfs-system
kubectl get sc2.5 Prometheus Stack
Function: Integrates Prometheus, Grafana, Alertmanager, and other monitoring components.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm pull prometheus-community/kube-prometheus-stack --untar --destination <your-path>/charts
helm install prometheus <your-path>/charts/kube-prometheus-stack \
-n monitoring --create-namespaceKey Images:
| Component | Image |
|---|---|
| Prometheus | quay.io/prometheus/prometheus:v2.54.1 |
| Grafana | docker.io/grafana/grafana:10.4.1 |
| Alertmanager | quay.io/prometheus/alertmanager:v0.27.0 |
| Node Exporter | quay.io/prometheus/node-exporter:v1.8.1 |
| Kube-State-Metrics | registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0 |
Access Grafana:
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
# Access in browser: http://localhost:3000
# Default credentials: admin / prom-operator2.6 Volcano Scheduler
Function: Provides GPU job scheduling, queue management, and task preemption.
helm repo add volcano https://volcano-sh.github.io/helm-charts
helm repo update
helm pull volcano/volcano --untar --destination <your-path>/charts
helm install volcano <your-path>/charts/volcano \
-n volcano-system --create-namespace \
-f <your-path>/volcano/values.yamlKey Images:
| Component | Image |
|---|---|
| Scheduler | volcano.sh/volcano-scheduler:v1.9.0 |
| Controller | volcano.sh/volcano-controllers:v1.9.0 |
| Admission | volcano.sh/volcano-admission:v1.9.0 |
| Webhook | volcano.sh/volcano-webhook:v1.9.0 |
Verification:
kubectl get pods -n volcano-system2.7 MetalLB (Bare-Metal Load Balancer)
Function: Provides LoadBalancer IP support for bare-metal clusters.
helm repo add metallb https://metallb.github.io/metallb
helm repo update
helm pull metallb/metallb --untar --destination <your-path>/charts
helm install metallb <your-path>/charts/metallb \
-n metallb-system --create-namespaceKey Images:
| Component | Image |
|---|---|
| Controller | quay.io/metallb/controller:v0.14.8 |
| Speaker | quay.io/metallb/speaker:v0.14.8 |
Example IP address pool configuration:
cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-address-pool
namespace: metallb-system
spec:
addresses:
- <your-ip-range> # Example: 192.168.1.200-192.168.1.220
EOF2.8 ingress-nginx (Ingress Controller)
Function: Provides external access points for Crater frontend and API.
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
-n ingress-nginx --create-namespace \
--version 4.11.3 \
--set controller.hostNetwork=true \
--set controller.dnsPolicy=ClusterFirstWithHostNet \
--set controller.healthCheckHost="<your-node-ip>" \
--set 'controller.nodeSelector.kubernetes\.io/hostname=node-2'Key Images:
| Component | Image |
|---|---|
| Controller | registry.k8s.io/ingress-nginx/controller:v1.9.6 |
| Admission Webhook | registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.6.0 |
Verification:
kubectl get pods -n ingress-nginx -o wide
kubectl get svc -n ingress-nginx -o wide3. Harbor Registry Deployment
helm repo add harbor https://helm.goharbor.io
helm repo update
helm pull harbor/harbor --version 1.16.0 --untarExample configuration file (values.yaml):
expose:
type: nodePort
tls:
enabled: false
nodePort:
ports:
http:
port: 30002
externalURL: http://<your-node-ip>:30002
harborAdminPassword: "<MUSTEDIT>"
persistence:
enabled: true
persistentVolumeClaim:
registry:
size: 50GiInstallation command:
helm install harbor <your-path>/charts/harbor \
-n harbor-system --create-namespace \
-f <your-path>/harbor/values.yamlAccess URL:
http://<your-node-ip>:300024. Crater Platform Deployment
Pull the Helm Chart:
helm pull oci://ghcr.io/raids-lab/crater --version 0.1.5 --untarThe core configuration file for the Crater platform is values.yaml.
This file defines cluster domain name, database connection, monitoring service addresses, storage PVCs, and connection parameters for external dependencies such as Harbor.
Before running helm install, update the relevant fields according to the following instructions.
4.1 Basic Information
# Platform access domain name
host: crater.example.com
# Protocol type: "http" or "https"
protocol: http
# Initial admin account
firstUser:
username: crater-admin
password: <MUSTEDIT>If Ingress and DNS are configured, use a real domain name (e.g., crater.mycluster.local).
For testing environments only, use the control node IP.
4.2 Storage Configuration (NFS)
Since an NFS shared storage system has been deployed, the storage section should reference the corresponding StorageClass:
storage:
create: true
request: 10Gi
storageClass: "nfs" # Use the NFS StorageClass created earlier
pvcName: "crater-rw-storage" # Name of the shared PVC mounted behindThe Crater backend will automatically mount this PVC for user space and public directories (users/, accounts/, public/).
4.3 PostgreSQL Database Configuration (CloudNativePG)
Crater uses a database cluster deployed via CloudNativePG.
Set the database connection parameters to point to the corresponding service:
backendConfig:
postgres:
host: crater-postgresql.cnpg-system.svc.cluster.local # Service name of CloudNativePG Cluster
port: 5432
dbname: postgres
user: postgres
password: <MUSTEDIT>
sslmode: disable
TimeZone: Asia/ShanghaiNote:
The
hostcan be found usingkubectl get svc -n cnpg-system.If the default CloudNativePG cluster name is
crater-postgresql, the service name should be:
crater-postgresql-rw.cnpg-system.svc.cluster.local.
4.4 Monitoring System Configuration (Prometheus Stack)
Crater backend retrieves GPU and job metrics via the Prometheus API.
Set backendConfig.prometheusAPI to the Prometheus service address in kube-prometheus-stack:
backendConfig:
prometheusAPI: http://prometheus-kube-prometheus-prometheus.monitoring:9090Retrieve this address with:
kubectl get svc -n monitoring | grep prometheusGrafana integration example:
grafanaProxy:
enable: true
address: http://prometheus-grafana.monitoring # Grafana service name
token: <MASKED> # Read-only Grafana API token
host: gpu-grafana.example.com # External access domain name4.5 Harbor Registry Configuration
If Harbor is already deployed in the cluster, enable the Registry integration.
After enabling, Crater can automatically push built images to the Harbor registry.
backendConfig:
registry:
enable: true
harbor:
server: harbor.example.com # Harbor access domain name
user: admin # Admin username
password: <MUSTEDIT> # Admin password
buildTools:
proxyConfig:
httpProxy: null
httpsProxy: null
noProxy: nullIf Harbor is not yet enabled, keep
enable: false.
4.6 Ingress and TLS Configuration (ingress-nginx + cert-manager)
Crater exposes services via Ingress by default.
If ingress-nginx is enabled and certificates are prepared, specify them in backendConfig.secrets:
backendConfig:
secrets:
tlsSecretName: crater-tls-secret
tlsForwardSecretName: crater-tls-forward-secret
imagePullSecretName: ""Create the certificate with:
kubectl create secret tls crater-tls-secret \
--cert=tls.crt --key=tls.key -n crater-systemIf HTTPS is not enabled, keep the default values; the protocol remains HTTP.
4.7 LDAP Configuration
If your organization already has an existing LDAP directory service (such as OpenLDAP or Active Directory), you can configure Crater to integrate with this service for unified identity authentication.
If LDAP is enabled, the system will automatically complete account registration upon the user's first login and continuously synchronize information such as nickname and email from LDAP.
If you do not have such requirements, you can directly set auth.ldap.enable to false, and other LDAP configuration items can be left at their default values.
backendConfig:
auth:
ldap:
enable: true
# Short display name for the LDAP login method in the UI, e.g. "ACT"
# The UI will append suffixes like "Login" or "Unified Identity", so keep it brief.
alias: "ACT"
# Help text shown when hovering over the LDAP login option
help: "Please use your centralized identity account to sign in. The system will automatically synchronize your profile from LDAP."
server:
address: "ldap://ldap.example.com:389"
bindDN: "cn=admin,dc=example,dc=org"
bindPassword: "<MUSTEDIT>"
baseDN: "dc=example,dc=org"
attributeMapping:
username: "uid"
displayName: "cn"
email: "mail"
uid:
# UID/GID acquisition strategy when using LDAP authentication. Options:
# - "default" / "none": Use default UID=1001, GID=1001 (recommended for most clusters)
# - "ldap": Read UID/GID directly from specified LDAP attributes
# - "rid": Parse RID from objectSid/primaryGroupID and compute:
# UID = RID(objectSid) + offset, GID = RID(primaryGroupID) + offset
# - "external": Use a legacy internal UID service (deprecated)
source: "default"
# When source is "rid", align with Winbind by computing:
# - UID from the user's objectSid
# - GID from the user's primaryGroupID
rid:
offset: 10000
# LDAP attribute that stores the binary SID. For Windows AD this is usually "objectSid".
sidAttribute: "objectSid"
# LDAP attribute that stores the primary group RID. For Windows AD this is usually "primaryGroupID".
pgidAttribute: "primaryGroupID"
# When source is "ldap", specify which LDAP attributes contain UID/GID
ldapAttribute:
uid: "uidNumber"
gid: "gidNumber"If LDAP is enabled, the username and displayName in the field mapping are mandatory along with the server configuration. The platform uses the former as the username and the latter as the display name. If your LDAP does not have a corresponding attribute, you can map it to the same LDAP attribute as username.
The uid.source configuration determines how the platform assigns container UID/GID when a user is auto-registered on first login:
- When set to
default/none, the platform always uses 1001:1001 as the container runtime identity. This is the simplest and recommended choice for most NFS or local storage environments. - When set to
ldap, the platform reads UID/GID directly from the LDAP attributes specified byuid.ldapAttribute.uid/gid, suitable for directories that already expose POSIX attributes. - When set to
rid, the platform parses the user'sobjectSid, extracts the RID, and computes UID/GID based onuid.rid.offset(for example, UID = RID + 10000), which keeps IDs consistent with environments that use Winbind-based ID mapping. - When set to
external, the platform calls a legacy internal UID service to obtain UID/GID. This mode exists only for backward compatibility and is no longer recommended.
For more detailed instructions, please refer to the Configuration Guide document.
4.8 Deployment Command
After confirming all configurations:
helm install crater oci://ghcr.io/raids-lab/crater \
--version 0.1.5 \
-n crater-system \
-f values.yamlVerification:
kubectl get pods -n crater-system
kubectl get ingress -n crater-systemAccess URL:
http://crater.example.comThis document applies to Crater v0.1.0 and above.
Edit on GitHubMinimal Deployment
This document will guide you through quickly setting up a Crater environment locally using Kind. Crater is a distributed training platform based on Kubernetes. This guide will cover the complete process from creating a Kind cluster to deploying all necessary components of Crater.
CloudNativePG Deployment Guide
Integrate PostgreSQL into Crater using the CloudNativePG Operator with OpenEBS storage.