"how numbers are stored and used in computers"
This is a guide based on the official Kubernetes deployment instructions. The containerized nature of SGLang makes it well-suited for Kubernetes deployment.
Here is a basic deployment manifest for SGLang.
code.txt1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: sglang-deployment 5 labels: 6 app: sglang 7spec: 8 replicas: 1 9 selector: 10 matchLabels: 11 app: sglang 12 template: 13 metadata: 14 labels: 15 app: sglang 16 spec: 17 containers: 18 - name: sglang 19 image: sglang-cpu:main 20 ports: 21 - containerPort: 30000 22 env: 23 - name: HF_TOKEN 24 valueFrom: 25 secretKeyRef: 26 name: huggingface-secret 27 key: token 28 resources: 29 requests: 30 memory: "8Gi" 31 cpu: "4" 32 limits: 33 memory: "16Gi" 34 cpu: "8" 35 volumeMounts: 36 - name: model-cache 37 mountPath: /root/.cache/huggingface 38 - name: shm 39 mountPath: /dev/shm 40 volumes: 41 - name: model-cache 42 persistentVolumeClaim: 43 claimName: model-cache-pvc 44 - name: shm 45 emptyDir: 46 medium: Memory 47 sizeLimit: 2Gi 48--- 49apiVersion: v1 50kind: Service 51metadata: 52 name: sglang-service 53spec: 54 selector: 55 app: sglang 56 ports: 57 - protocol: TCP 58 port: 80 59 targetPort: 30000 60 type: LoadBalancer
For GPU workloads, you will need to configure GPU resources and node selectors.
code.txt1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: sglang-gpu-deployment 5 labels: 6 app: sglang-gpu 7spec: 8 replicas: 1 9 selector: 10 matchLabels: 11 app: sglang-gpu 12 template: 13 metadata: 14 labels: 15 app: sglang-gpu 16 spec: 17 nodeSelector: 18 accelerator: nvidia-tesla-v100 19 containers: 20 - name: sglang 21 image: sglang-gpu:latest 22 ports: 23 - containerPort: 30000 24 env: 25 - name: CUDA_VISIBLE_DEVICES 26 value: "0" 27 - name: HF_TOKEN 28 valueFrom: 29 secretKeyRef: 30 name: huggingface-secret 31 key: token 32 resources: 33 requests: 34 memory: "16Gi" 35 cpu: "8" 36 nvidia.com/gpu: 1 37 limits: 38 memory: "32Gi" 39 cpu: "16" 40 nvidia.com/gpu: 1 41 volumeMounts: 42 - name: model-cache 43 mountPath: /root/.cache/huggingface 44 - name: shm 45 mountPath: /dev/shm 46 volumes: 47 - name: model-cache 48 persistentVolumeClaim: 49 claimName: model-cache-pvc 50 - name: shm 51 emptyDir: 52 medium: Memory 53 sizeLimit: 8Gi
Create a persistent volume claim for model caching to improve startup times.
code.txt1apiVersion: v1 2kind: PersistentVolumeClaim 3metadata: 4 name: model-cache-pvc 5spec: 6 accessModes: 7 - ReadWriteOnce 8 resources: 9 requests: 10 storage: 100Gi 11 storageClassName: fast-ssd
Store your Hugging Face token securely using Kubernetes secrets.
code.txt1kubectl create secret generic huggingface-secret \ 2 --from-literal=token=your_huggingface_token_here
For dynamic scaling based on CPU utilization, you can use the following manifest to configure the Horizontal Pod Autoscaler.
code.txt1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: sglang-hpa 5spec: 6 scaleTargetRef: 7 apiVersion: apps/v1 8 kind: Deployment 9 name: sglang-deployment 10 minReplicas: 1 11 maxReplicas: 10 12 metrics: 13 - type: Resource 14 resource: 15 name: cpu 16 target: 17 type: Utilization 18 averageUtilization: 70
The time complexity for model loading is O(n) where n is the model size, while inference time complexity varies by model architecture but is typically O(s·d) where s is sequence length and d is model dimension. Space complexity is O(m + c) where m is model parameters and c is context cache size.
Resource Requirements. SGLang can be memory-intensive, especially when loading large models. Plan for adequate memory allocation with a typical requirement of 8-32 GB RAM depending on model size. GPU deployments may require 16-80 GB VRAM for larger models.
Model Caching Strategy. Use persistent volumes to cache downloaded models, reducing startup time from minutes to seconds. Consider using a shared filesystem like NFS or distributed storage for multi-pod access to the same models.
Network Configuration. Configure appropriate ingress controllers and load balancers to handle external traffic. Consider using service mesh technologies like Istio for advanced traffic management and observability.
Monitoring and Observability. Implement comprehensive monitoring using Prometheus and Grafana to track model performance, request latency, and resource utilization. Set up alerts for memory pressure, GPU utilization, and response time degradation.
Security. Use Pod Security Standards to enforce security policies, implement network policies to restrict inter-pod communication, and regularly update container images to address security vulnerabilities.
For simplified Kubernetes deployments, you can create a Helm chart, as shown below in the values.yaml
configuration.
code.txt1# values.yaml 2replicaCount: 1 3 4image: 5 repository: sglang-cpu 6 tag: main 7 pullPolicy: IfNotPresent 8 9service: 10 type: LoadBalancer 11 port: 80 12 targetPort: 30000 13 14resources: 15 requests: 16 memory: "8Gi" 17 cpu: "4" 18 limits: 19 memory: "16Gi" 20 cpu: "8" 21 22# GPU configuration 23gpu: 24 enabled: false 25 count: 1 26 nodeSelector: 27 accelerator: nvidia-tesla-v100 28 29# Model caching 30persistence: 31 enabled: true 32 size: 100Gi 33 storageClass: fast-ssd 34 35# Hugging Face configuration 36huggingface: 37 token: "" # Set via --set huggingface.token=your_token 38 39# Autoscaling 40autoscaling: 41 enabled: false 42 minReplicas: 1 43 maxReplicas: 10 44 targetCPUUtilizationPercentage: 70 45 46# Monitoring 47monitoring: 48 enabled: false 49 serviceMonitor: 50 enabled: false
You can now deploy the chart using Helm.
code.txt1# Install the chart 2helm install sglang ./sglang-chart \ 3 --set huggingface.token=your_huggingface_token \ 4 --set gpu.enabled=true 5 6# Upgrade deployment 7helm upgrade sglang ./sglang-chart \ 8 --set replicaCount=3 \ 9 --set autoscaling.enabled=true