"how numbers are stored and used in computers"

SGLang on Kubernetes

This is a guide based on the official Kubernetes deployment instructions. The containerized nature of SGLang makes it well-suited for Kubernetes deployment.

Basic Kubernetes Deployment

Here is a basic deployment manifest for SGLang.

code.txt
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: sglang-deployment
5  labels:
6    app: sglang
7spec:
8  replicas: 1
9  selector:
10    matchLabels:
11      app: sglang
12  template:
13    metadata:
14      labels:
15        app: sglang
16    spec:
17      containers:
18      - name: sglang
19        image: sglang-cpu:main
20        ports:
21        - containerPort: 30000
22        env:
23        - name: HF_TOKEN
24          valueFrom:
25            secretKeyRef:
26              name: huggingface-secret
27              key: token
28        resources:
29          requests:
30            memory: "8Gi"
31            cpu: "4"
32          limits:
33            memory: "16Gi"
34            cpu: "8"
35        volumeMounts:
36        - name: model-cache
37          mountPath: /root/.cache/huggingface
38        - name: shm
39          mountPath: /dev/shm
40      volumes:
41      - name: model-cache
42        persistentVolumeClaim:
43          claimName: model-cache-pvc
44      - name: shm
45        emptyDir:
46          medium: Memory
47          sizeLimit: 2Gi
48---
49apiVersion: v1
50kind: Service
51metadata:
52  name: sglang-service
53spec:
54  selector:
55    app: sglang
56  ports:
57  - protocol: TCP
58    port: 80
59    targetPort: 30000
60  type: LoadBalancer

GPU-Enabled Deployment

For GPU workloads, you will need to configure GPU resources and node selectors.

code.txt
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: sglang-gpu-deployment
5  labels:
6    app: sglang-gpu
7spec:
8  replicas: 1
9  selector:
10    matchLabels:
11      app: sglang-gpu
12  template:
13    metadata:
14      labels:
15        app: sglang-gpu
16    spec:
17      nodeSelector:
18        accelerator: nvidia-tesla-v100
19      containers:
20      - name: sglang
21        image: sglang-gpu:latest
22        ports:
23        - containerPort: 30000
24        env:
25        - name: CUDA_VISIBLE_DEVICES
26          value: "0"
27        - name: HF_TOKEN
28          valueFrom:
29            secretKeyRef:
30              name: huggingface-secret
31              key: token
32        resources:
33          requests:
34            memory: "16Gi"
35            cpu: "8"
36            nvidia.com/gpu: 1
37          limits:
38            memory: "32Gi"
39            cpu: "16"
40            nvidia.com/gpu: 1
41        volumeMounts:
42        - name: model-cache
43          mountPath: /root/.cache/huggingface
44        - name: shm
45          mountPath: /dev/shm
46      volumes:
47      - name: model-cache
48        persistentVolumeClaim:
49          claimName: model-cache-pvc
50      - name: shm
51        emptyDir:
52          medium: Memory
53          sizeLimit: 8Gi

Persistent Volume Configuration

Create a persistent volume claim for model caching to improve startup times.

code.txt
1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4  name: model-cache-pvc
5spec:
6  accessModes:
7    - ReadWriteOnce
8  resources:
9    requests:
10      storage: 100Gi
11  storageClassName: fast-ssd

Secrets Configuration

Store your Hugging Face token securely using Kubernetes secrets.

code.txt
1kubectl create secret generic huggingface-secret \
2  --from-literal=token=your_huggingface_token_here

Horizontal Pod Autoscaler

For dynamic scaling based on CPU utilization, you can use the following manifest to configure the Horizontal Pod Autoscaler.

code.txt
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: sglang-hpa
5spec:
6  scaleTargetRef:
7    apiVersion: apps/v1
8    kind: Deployment
9    name: sglang-deployment
10  minReplicas: 1
11  maxReplicas: 10
12  metrics:
13  - type: Resource
14    resource:
15      name: cpu
16      target:
17        type: Utilization
18        averageUtilization: 70

Production Considerations

The time complexity for model loading is O(n) where n is the model size, while inference time complexity varies by model architecture but is typically O(s·d) where s is sequence length and d is model dimension. Space complexity is O(m + c) where m is model parameters and c is context cache size.

Resource Requirements. SGLang can be memory-intensive, especially when loading large models. Plan for adequate memory allocation with a typical requirement of 8-32 GB RAM depending on model size. GPU deployments may require 16-80 GB VRAM for larger models.

Model Caching Strategy. Use persistent volumes to cache downloaded models, reducing startup time from minutes to seconds. Consider using a shared filesystem like NFS or distributed storage for multi-pod access to the same models.

Network Configuration. Configure appropriate ingress controllers and load balancers to handle external traffic. Consider using service mesh technologies like Istio for advanced traffic management and observability.

Monitoring and Observability. Implement comprehensive monitoring using Prometheus and Grafana to track model performance, request latency, and resource utilization. Set up alerts for memory pressure, GPU utilization, and response time degradation.

Security. Use Pod Security Standards to enforce security policies, implement network policies to restrict inter-pod communication, and regularly update container images to address security vulnerabilities.

Helm Chart Deployment

For simplified Kubernetes deployments, you can create a Helm chart, as shown below in the values.yaml configuration.

code.txt
1# values.yaml
2replicaCount: 1
3
4image:
5  repository: sglang-cpu
6  tag: main
7  pullPolicy: IfNotPresent
8
9service:
10  type: LoadBalancer
11  port: 80
12  targetPort: 30000
13
14resources:
15  requests:
16    memory: "8Gi"
17    cpu: "4"
18  limits:
19    memory: "16Gi"
20    cpu: "8"
21
22# GPU configuration
23gpu:
24  enabled: false
25  count: 1
26  nodeSelector:
27    accelerator: nvidia-tesla-v100
28
29# Model caching
30persistence:
31  enabled: true
32  size: 100Gi
33  storageClass: fast-ssd
34
35# Hugging Face configuration
36huggingface:
37  token: ""  # Set via --set huggingface.token=your_token
38
39# Autoscaling
40autoscaling:
41  enabled: false
42  minReplicas: 1
43  maxReplicas: 10
44  targetCPUUtilizationPercentage: 70
45
46# Monitoring
47monitoring:
48  enabled: false
49  serviceMonitor:
50    enabled: false

You can now deploy the chart using Helm.

code.txt
1# Install the chart
2helm install sglang ./sglang-chart \
3  --set huggingface.token=your_huggingface_token \
4  --set gpu.enabled=true
5
6# Upgrade deployment
7helm upgrade sglang ./sglang-chart \
8  --set replicaCount=3 \
9  --set autoscaling.enabled=true