Crater

Crater Cluster Deployment Guide

This guide outlines the standardized deployment process of the Crater platform on a production-grade Kubernetes cluster, including environment preparation, dependency component installation, core service deployment, and common troubleshooting.

1. Environment and Prerequisites

ItemDescription
Operating SystemUbuntu 22.04
Kubernetesv1.31.x
Container Runtimecontainerd 1.7.x
Helmv3.x
Node Configuration1 control node, ≥ 2 worker nodes
Network RequirementsNodes must be 互通 via internal network; at least one node must have external network access or proxy configuration

Example:

kubectl get nodes -o wide

Sample Output:

NAME       STATUS   ROLES           VERSION   INTERNAL-IP     OS-IMAGE
node-1     Ready    control-plane   v1.31.2   <your-node-ip>  Ubuntu 22.04.5 LTS
node-2     Ready    <none>          v1.31.2   <your-node-ip>  Ubuntu 22.04.5 LTS
node-3     Ready    <none>          v1.31.2   <your-node-ip>  Ubuntu 22.04.5 LTS

2. Installation of Cluster Dependencies

Crater platform relies on the following foundational components to enable scheduling, monitoring, networking, and storage functionality.
This section includes installation commands and image information for each component.

2.1 Metrics Server

Function: Provides CPU and memory metrics for nodes and pods, enabling HPA-based auto-scaling.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm repo update
helm pull metrics-server/metrics-server --untar --destination <your-path>/charts

helm install metrics-server <your-path>/charts/metrics-server \
  -n kube-system --create-namespace

Images:

registry.k8s.io/metrics-server/metrics-server:v0.8.0
# For users in China, replace with:
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/metrics-server/metrics-server:v0.8.0

Verification:

kubectl get pods -n kube-system | grep metrics-server
kubectl top nodes

2.2 NVIDIA GPU Operator

Function: Installs GPU drivers, device plugins, and monitoring components.

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm pull nvidia/gpu-operator --untar --destination <your-path>/charts

helm install gpu-operator <your-path>/charts/gpu-operator \
  -n gpu-operator --create-namespace

Key Images:

ComponentImage
Driver Containernvcr.io/nvidia/driver:525.125.06
Device Pluginnvcr.io/nvidia/k8s-device-plugin:v0.15.0
DCGM Exporternvcr.io/nvidia/dcgm-exporter:3.1.6-3.1.3-ubuntu22.04
MIG Managernvcr.io/nvidia/mig-manager:0.6.0
Node Feature Discoveryghcr.io/kubernetes-sigs/node-feature-discovery:v0.16.1

Verification:

kubectl get pods -n gpu-operator
nvidia-smi

2.3 CloudNativePG (PostgreSQL Operator)

Function: Provides high-availability PostgreSQL database services.

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
helm pull cnpg/cloudnative-pg --untar --destination <your-path>/charts

helm install cnpg <your-path>/charts/cloudnative-pg \
  -n cnpg-system --create-namespace

Images:

ghcr.io/cloudnative-pg/postgresql:16.3
ghcr.io/cloudnative-pg/cloudnative-pg:1.24.0

Create a sample database cluster:

cat <<EOF | kubectl apply -f -
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: crater-postgresql
  namespace: cnpg-system
spec:
  instances: 3
  storage:
    size: 10Gi
EOF

2.4 NFS Storage System

Function: Provides shared storage (ReadWriteMany mode) for user workspaces and public data.

2.4.1 Install NFS Server (Optional)

Run on a node with persistent disk:

sudo apt update
sudo apt install -y nfs-kernel-server
sudo mkdir -p /data/nfs
sudo chown -R nobody:nogroup /data/nfs
sudo chmod 777 /data/nfs

echo "/data/nfs *(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
sudo exportfs -a
sudo systemctl enable --now nfs-server

Verification:

showmount -e <your-node-ip>

2.4.2 Create NFS StorageClass

Run in the Kubernetes cluster:

cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs
provisioner: nfs-provisioner
parameters:
  archiveOnDelete: "false"
reclaimPolicy: Retain
volumeBindingMode: Immediate
EOF

2.4.3 Deploy NFS Provisioner

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
helm repo update

helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  -n nfs-system --create-namespace \
  --set nfs.server=<your-node-ip> \
  --set nfs.path=/data/nfs \
  --set storageClass.name=nfs \
  --set storageClass.defaultClass=true

Verification:

kubectl get pods -n nfs-system
kubectl get sc

2.5 Prometheus Stack

Function: Integrates Prometheus, Grafana, Alertmanager, and other monitoring components.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm pull prometheus-community/kube-prometheus-stack --untar --destination <your-path>/charts

helm install prometheus <your-path>/charts/kube-prometheus-stack \
  -n monitoring --create-namespace

Key Images:

ComponentImage
Prometheusquay.io/prometheus/prometheus:v2.54.1
Grafanadocker.io/grafana/grafana:10.4.1
Alertmanagerquay.io/prometheus/alertmanager:v0.27.0
Node Exporterquay.io/prometheus/node-exporter:v1.8.1
Kube-State-Metricsregistry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0

Access Grafana:

kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
# Access in browser: http://localhost:3000
# Default credentials: admin / prom-operator

2.6 Volcano Scheduler

Function: Provides GPU job scheduling, queue management, and task preemption.

helm repo add volcano https://volcano-sh.github.io/helm-charts
helm repo update
helm pull volcano/volcano --untar --destination <your-path>/charts

helm install volcano <your-path>/charts/volcano \
  -n volcano-system --create-namespace \
  -f <your-path>/volcano/values.yaml

Key Images:

ComponentImage
Schedulervolcano.sh/volcano-scheduler:v1.9.0
Controllervolcano.sh/volcano-controllers:v1.9.0
Admissionvolcano.sh/volcano-admission:v1.9.0
Webhookvolcano.sh/volcano-webhook:v1.9.0

Verification:

kubectl get pods -n volcano-system

2.7 MetalLB (Bare-Metal Load Balancer)

Function: Provides LoadBalancer IP support for bare-metal clusters.

helm repo add metallb https://metallb.github.io/metallb
helm repo update
helm pull metallb/metallb --untar --destination <your-path>/charts

helm install metallb <your-path>/charts/metallb \
  -n metallb-system --create-namespace

Key Images:

ComponentImage
Controllerquay.io/metallb/controller:v0.14.8
Speakerquay.io/metallb/speaker:v0.14.8

Example IP address pool configuration:

cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-address-pool
  namespace: metallb-system
spec:
  addresses:
    - <your-ip-range>  # Example: 192.168.1.200-192.168.1.220
EOF

2.8 ingress-nginx (Ingress Controller)

Function: Provides external access points for Crater frontend and API.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  -n ingress-nginx --create-namespace \
  --version 4.11.3 \
  --set controller.hostNetwork=true \
  --set controller.dnsPolicy=ClusterFirstWithHostNet \
  --set controller.healthCheckHost="<your-node-ip>" \
  --set 'controller.nodeSelector.kubernetes\.io/hostname=node-2'

Key Images:

ComponentImage
Controllerregistry.k8s.io/ingress-nginx/controller:v1.9.6
Admission Webhookregistry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.6.0

Verification:

kubectl get pods -n ingress-nginx -o wide
kubectl get svc -n ingress-nginx -o wide

3. Harbor Registry Deployment

helm repo add harbor https://helm.goharbor.io
helm repo update
helm pull harbor/harbor --version 1.16.0 --untar

Example configuration file (values.yaml):

expose:
  type: nodePort
  tls:
    enabled: false
  nodePort:
    ports:
      http:
        port: 30002
externalURL: http://<your-node-ip>:30002
harborAdminPassword: "<MUSTEDIT>"
persistence:
  enabled: true
  persistentVolumeClaim:
    registry:
      size: 50Gi

Installation command:

helm install harbor <your-path>/charts/harbor \
  -n harbor-system --create-namespace \
  -f <your-path>/harbor/values.yaml

Access URL:

http://<your-node-ip>:30002

4. Crater Platform Deployment

Pull the Helm Chart:

This version number is the Helm Chart configuration version, extracted from Chart.yaml, and must correspond to the Crater component image versions for the system to function correctly. (Current version: 0.1.5)
helm pull oci://ghcr.io/raids-lab/crater --version 0.1.5 --untar

The core configuration file for the Crater platform is values.yaml.
This file defines cluster domain name, database connection, monitoring service addresses, storage PVCs, and connection parameters for external dependencies such as Harbor.

Before running helm install, update the relevant fields according to the following instructions.


4.1 Basic Information

# Platform access domain name
host: crater.example.com

# Protocol type: "http" or "https"
protocol: http

# Initial admin account
firstUser:
  username: crater-admin
  password: <MUSTEDIT>

If Ingress and DNS are configured, use a real domain name (e.g., crater.mycluster.local).
For testing environments only, use the control node IP.


4.2 Storage Configuration (NFS)

Since an NFS shared storage system has been deployed, the storage section should reference the corresponding StorageClass:

storage:
  create: true
  request: 10Gi
  storageClass: "nfs"          # Use the NFS StorageClass created earlier
  pvcName: "crater-rw-storage" # Name of the shared PVC mounted behind

The Crater backend will automatically mount this PVC for user space and public directories (users/, accounts/, public/).


4.3 PostgreSQL Database Configuration (CloudNativePG)

Crater uses a database cluster deployed via CloudNativePG.
Set the database connection parameters to point to the corresponding service:

backendConfig:
  postgres:
    host: crater-postgresql.cnpg-system.svc.cluster.local  # Service name of CloudNativePG Cluster
    port: 5432
    dbname: postgres
    user: postgres
    password: <MUSTEDIT>
    sslmode: disable
    TimeZone: Asia/Shanghai

Note:

  • The host can be found using kubectl get svc -n cnpg-system.

  • If the default CloudNativePG cluster name is crater-postgresql, the service name should be:
    crater-postgresql-rw.cnpg-system.svc.cluster.local.


4.4 Monitoring System Configuration (Prometheus Stack)

Crater backend retrieves GPU and job metrics via the Prometheus API.
Set backendConfig.prometheusAPI to the Prometheus service address in kube-prometheus-stack:

backendConfig:
  prometheusAPI: http://prometheus-kube-prometheus-prometheus.monitoring:9090

Retrieve this address with:

kubectl get svc -n monitoring | grep prometheus

Grafana integration example:

grafanaProxy:
  enable: true
  address: http://prometheus-grafana.monitoring  # Grafana service name
  token: <MASKED>                                # Read-only Grafana API token
  host: gpu-grafana.example.com                  # External access domain name

4.5 Harbor Registry Configuration

If Harbor is already deployed in the cluster, enable the Registry integration.
After enabling, Crater can automatically push built images to the Harbor registry.

backendConfig:
  registry:
    enable: true
    harbor:
      server: harbor.example.com      # Harbor access domain name
      user: admin                     # Admin username
      password: <MUSTEDIT>            # Admin password
    buildTools:
      proxyConfig:
        httpProxy: null
        httpsProxy: null
        noProxy: null

If Harbor is not yet enabled, keep enable: false.


4.6 Ingress and TLS Configuration (ingress-nginx + cert-manager)

Crater exposes services via Ingress by default.
If ingress-nginx is enabled and certificates are prepared, specify them in backendConfig.secrets:

backendConfig:
  secrets:
    tlsSecretName: crater-tls-secret
    tlsForwardSecretName: crater-tls-forward-secret
    imagePullSecretName: ""

Create the certificate with:

kubectl create secret tls crater-tls-secret \
  --cert=tls.crt --key=tls.key -n crater-system

If HTTPS is not enabled, keep the default values; the protocol remains HTTP.


4.7 LDAP Configuration

If your organization already has an existing LDAP directory service (such as OpenLDAP or Active Directory), you can configure Crater to integrate with this service for unified identity authentication.

If LDAP is enabled, the system will automatically complete account registration upon the user's first login and continuously synchronize information such as nickname and email from LDAP.

If you do not have such requirements, you can directly set auth.ldap.enable to false, and other LDAP configuration items can be left at their default values.

backendConfig:
  auth:
    ldap:
      enable: true
      # Short display name for the LDAP login method in the UI, e.g. "ACT"
      # The UI will append suffixes like "Login" or "Unified Identity", so keep it brief.
      alias: "ACT"
      # Help text shown when hovering over the LDAP login option
      help: "Please use your centralized identity account to sign in. The system will automatically synchronize your profile from LDAP."
      server:
        address: "ldap://ldap.example.com:389"
        bindDN: "cn=admin,dc=example,dc=org"
        bindPassword: "<MUSTEDIT>"
        baseDN: "dc=example,dc=org"
      attributeMapping:
        username: "uid"
        displayName: "cn"
        email: "mail"
      uid:
        # UID/GID acquisition strategy when using LDAP authentication. Options:
        # - "default" / "none": Use default UID=1001, GID=1001 (recommended for most clusters)
        # - "ldap": Read UID/GID directly from specified LDAP attributes
        # - "rid": Parse RID from objectSid/primaryGroupID and compute:
        #          UID = RID(objectSid) + offset, GID = RID(primaryGroupID) + offset
        # - "external": Use a legacy internal UID service (deprecated)
        source: "default"
        # When source is "rid", align with Winbind by computing:
        # - UID from the user's objectSid
        # - GID from the user's primaryGroupID
        rid:
          offset: 10000
          # LDAP attribute that stores the binary SID. For Windows AD this is usually "objectSid".
          sidAttribute: "objectSid"
          # LDAP attribute that stores the primary group RID. For Windows AD this is usually "primaryGroupID".
          pgidAttribute: "primaryGroupID"
        # When source is "ldap", specify which LDAP attributes contain UID/GID
        ldapAttribute:
          uid: "uidNumber"
          gid: "gidNumber"

If LDAP is enabled, the username and displayName in the field mapping are mandatory along with the server configuration. The platform uses the former as the username and the latter as the display name. If your LDAP does not have a corresponding attribute, you can map it to the same LDAP attribute as username.

The uid.source configuration determines how the platform assigns container UID/GID when a user is auto-registered on first login:

  • When set to default / none, the platform always uses 1001:1001 as the container runtime identity. This is the simplest and recommended choice for most NFS or local storage environments.
  • When set to ldap, the platform reads UID/GID directly from the LDAP attributes specified by uid.ldapAttribute.uid/gid, suitable for directories that already expose POSIX attributes.
  • When set to rid, the platform parses the user's objectSid, extracts the RID, and computes UID/GID based on uid.rid.offset (for example, UID = RID + 10000), which keeps IDs consistent with environments that use Winbind-based ID mapping.
  • When set to external, the platform calls a legacy internal UID service to obtain UID/GID. This mode exists only for backward compatibility and is no longer recommended.

For more detailed instructions, please refer to the Configuration Guide document.


4.8 Deployment Command

After confirming all configurations:

helm install crater oci://ghcr.io/raids-lab/crater \
  --version 0.1.5 \
  -n crater-system \
  -f values.yaml

Verification:

kubectl get pods -n crater-system
kubectl get ingress -n crater-system

Access URL:

http://crater.example.com

This document applies to Crater v0.1.0 and above.

Edit on GitHub