Exciting news! TCMS official website is live! Offering full-stack software services including enterprise-level custom R&D, App and mini-program development, multi-system integration, AI, blockchain, and embedded development, empowering digital-intelligent transformation across industries. Visit dev.tekin.cn to discuss cooperation!
A complete enterprise-grade Ollama deployment solution for Kubernetes—covering namespace isolation, persistent storage, Deployment setup, and service exposure. This guide includes pre-deployment checks, step-by-step operations, validation methods, and production optimization tips (GPU acceleration, multi-replica scaling) to help developers and DevOps teams quickly launch stable, scalable Ollama services on K8s.

 
As lightweight LLMs gain traction, Ollama has become the top choice for SMBs to deploy AI capabilities with its "one-click setup and low barrier to use." However, single-machine Ollama deployments face production challenges: model loss, uncontrollable resources, and scaling difficulties. Kubernetes solves these pain points with its container orchestration capabilities. This guide provides a full-stack Ollama K8s deployment方案, supporting use cases from testing to production.
Ensure your environment meets these requirements to avoid configuration issues:
K8s Cluster: 1+ node cluster (recommended v1.24+, supports Containerd runtime). Verify with kubectl get nodes (nodes must be in Ready state).
Persistent Storage: Cluster must support persistent storage (e.g., default StorageClass, NFS, Local Path) to save Ollama models (models are lost if not persisted after Pod restarts).
Resource Reservation:
7B model: Minimum 4 CPU cores + 8GB memory
13B model: Minimum 8 CPU cores + 16GB memory
GPU acceleration: Pre-install NVIDIA device plugin (nvidia-device-plugin).
Tools: Local kubectl installed with cluster access. Verify with kubectl cluster-info.
Ollama’s K8s deployment requires 4 key components: namespace isolation, persistent storage (PVC), Deployment, and Service exposure. Below are production-ready configurations with English comments—modify as needed.
Create a dedicated namespace to separate Ollama resources from other services:
apiVersionv1
kindNamespace
metadata
  nameai-services  # All Ollama resources reside here
  labels
    appollama      # Unified label for resource filteringOllama stores models in /root/.ollama by default. Use PVC to persist models across Pod restarts:
apiVersionv1
kindPersistentVolumeClaim
metadata
  nameollama-models-pvc
  namespaceai-services
spec
  accessModes
ReadWriteOnce  # Use ReadWriteMany for multi-replica (requires storage class support like NFS)
  resources
    requests
      storage50Gi  # 7B model ≈4GB, 13B≈13GB; reserve 20% redundancy for multiple models
  storageClassName"standard"  # Replace with your cluster's storage class (e.g., aws-ebs, gcp-pd, nfs-client)Manages Ollama container lifecycle (resource limits, health checks, environment variables):
apiVersionapps/v1
kindDeployment
metadata
  nameollama
  namespaceai-services
  labels
    appollama
spec
  replicas1  # Enable multi-replica only after solving storage sharing (e.g., NFS)
  selector
    matchLabels
      appollama
  strategy
    typeRecreate  # Prevent model file conflicts in multi-replica scenarios
  template
    metadata
      labels
        appollama
    spec
      containers
nameollama
        imageollama/ollamalatest  # Use specific version in production (e.g., ollama/ollama:0.1.48)
        # For GPU acceleration: Use ollama/ollama:nvidia (requires NVIDIA device plugin)
        ports
containerPort11434  # Default Ollama API port
        env
nameOLLAMA_HOST
          value"0.0.0.0"  # Allow external access (default: localhost only)
nameOLLAMA_MAX_LOADED_MODELS
          value"2"  # Adjust based on available memory (default: 3)
        resources
          requests# Minimum resources for scheduling
            cpu"4"
            memory"12Gi"  # 7B model: 8GB+, 13B model: 16GB+
          limits# Prevent resource contention
            cpu"8"
            memory"16Gi"
        # GPU configuration (uncomment if using NVIDIA image)
        # resources:
        #   limits:
        #     nvidia.com/gpu: 1
        volumeMounts
namemodel-storage
          mountPath/root/.ollama  # Default Ollama model directory (do not modify)
          subPathollama  # Isolate directory for shared storage
namecache-volume
          mountPath/root/.cache/ollama  # Temporary cache for faster model loading
        livenessProbe# Restart Pod if service fails
          httpGet
            path/
            port11434
          initialDelaySeconds120  # Longer delay for model loading
          periodSeconds20
          timeoutSeconds5
        readinessProbe# Remove Pod from Service if unready
          httpGet
            path/
            port11434
          initialDelaySeconds60
          periodSeconds10
        securityContext
          runAsUser1000
          runAsGroup1000
          fsGroup1000  # Allow write access to model directory
      affinity
        # Prefer scheduling to GPU nodes (if available)
        nodeAffinity
          preferredDuringSchedulingIgnoredDuringExecution
weight100
            preference
              matchExpressions
keynvidia.com/gpu.present
                operatorIn
                values"true"
      tolerations
        # Tolerate GPU node taints
key"nvidia.com/gpu"
          operator"Exists"
          effect"NoSchedule"
      volumes
namemodel-storage
        persistentVolumeClaim
          claimNameollama-models-pvc
namecache-volume
        emptyDir
          mediumMemory
          sizeLimit2Gi  # In-memory cache for faster inferenceFixes temporary Pod IPs with a stable access address:
apiVersionv1
kindService
metadata
  nameollama-service
  namespaceai-services
  labels
    appollama
spec
  selector
    appollama
  ports
port11434
    targetPort11434
    protocolTCP
    namehttp
  typeClusterIP  # Modify for external access:
                   # - NodePort: Add nodePort: 30080 (range: 30000-32767)
                   # - LoadBalancer: For cloud environments (AWS ELB, GCP LB)Save all configurations to ollama-k8s-deploy.yaml and run:
kubectl apply -f ollama-k8s-deploy.yamlVerify resource status (all components must be healthy):
# Check namespace
kubectl get ns | grep ai-services
# Check PVC (STATUS: Bound)
kubectl get pvc -n ai-services
# Check Deployment (READY: 1/1)
kubectl get deployment -n ai-services
# Check Pod (STATUS: Running, RESTARTS: 0)
kubectl get pods -n ai-servicesTroubleshooting: If Pod is Pending, run kubectl describe pod <pod-name> -n ai-services to check for storage binding or resource shortages.
Once the Pod is running, pull your target model (e.g., Llama3-8B):
# Enter the Ollama container (replace <pod-name> with actual Pod name)
kubectl exec -it -n ai-services <pod-name> -- /bin/sh
# Pull model (e.g., Llama3-8B)
ollama pull llama3:8bCheck pull progress with:
kubectl logs -f <pod-name> -n ai-services# Run a test Pod
kubectl run -it busybox --image=busybox:1.35 -- /bin/sh
# Test model list API
wget -qO- http://ollama-service.ai-services:11434/api/tagsForward the Service port to your local machine:
kubectl port-forward -n ai-services service/ollama-service 11434:11434Test the chat API with curl:
curl http://localhost:11434/api/chat -d '{
  "model": "llama3:8b",
  "messages": [{"role": "user", "content": "What is Kubernetes?"}]
}'A successful response (JSON format) indicates the Ollama service is operational.
Multi-replica scenarios: Use ReadWriteMany-supported storage (NFS, AWS EFS, GCP Filestore).
Performance: Use SSD for faster model loading (HDD may cause timeouts for 13B+ models).
CPU/Memory:
7B model: 4-8 cores + 8-12GB memory
13B model: 8-16 cores + 16-32GB memory
34B model: 16+ cores + 64GB+ memory
GPU Acceleration: Use ollama/ollama:nvidia image (3-5x faster model loading, 50% lower latency).
Monitoring: Use Prometheus + Grafana to track model status, API throughput, and latency (Ollama exposes /metrics endpoint).
Log Collection: Forward container logs to ELK or Loki for troubleshooting.
High Availability: Add PodDisruptionBudget (PDB) to avoid service downtime during maintenance.
RBAC Permissions: Restrict access to the ai-services namespace to authorized users.
Network Isolation: Use NetworkPolicy to allow access only from trusted services.
Image Security: Store Ollama images in a private registry to prevent tampering.
This deployment balances stability, scalability, and resource efficiency—suitable for both testing and production. Key advantages include:
Persistent models: Avoid repeated downloads via PVC.
Controllable resources: Prevent resource contention with CPU/memory limits.
Flexible scaling: Support single-replica debugging and multi-replica high availability.
Adjust configurations (storage capacity, resource limits, Service type) based on your cluster resources and model requirements to quickly launch lightweight AI services.