Exciting news! TCMS official website is live! Offering full-stack software services including enterprise-level custom R&D, App and mini-program development, multi-system integration, AI, blockchain, and embedded development, empowering digital-intelligent transformation across industries. Visit dev.tekin.cn to discuss cooperation!

Ollama Service Kubernetes Deployment Guide: Full Workflow from Configuration to Validation (YAML Included)

2025-10-31 12 mins read

A complete enterprise-grade Ollama deployment solution for Kubernetes—covering namespace isolation, persistent storage, Deployment setup, and service exposure. This guide includes pre-deployment checks, step-by-step operations, validation methods, and production optimization tips (GPU acceleration, multi-replica scaling) to help developers and DevOps teams quickly launch stable, scalable Ollama services on K8s.

ollama-k8s-deployment-guide
 

Overview

As lightweight LLMs gain traction, Ollama has become the top choice for SMBs to deploy AI capabilities with its "one-click setup and low barrier to use." However, single-machine Ollama deployments face production challenges: model loss, uncontrollable resources, and scaling difficulties. Kubernetes solves these pain points with its container orchestration capabilities. This guide provides a full-stack Ollama K8s deployment方案, supporting use cases from testing to production.

I. Pre-Deployment Prerequisites

Ensure your environment meets these requirements to avoid configuration issues:

  • K8s Cluster: 1+ node cluster (recommended v1.24+, supports Containerd runtime). Verify with kubectl get nodes (nodes must be in Ready state).

  • Persistent Storage: Cluster must support persistent storage (e.g., default StorageClass, NFS, Local Path) to save Ollama models (models are lost if not persisted after Pod restarts).

  • Resource Reservation:

    • 7B model: Minimum 4 CPU cores + 8GB memory

    • 13B model: Minimum 8 CPU cores + 16GB memory

    • GPU acceleration: Pre-install NVIDIA device plugin (nvidia-device-plugin).

  • Tools: Local kubectl installed with cluster access. Verify with kubectl cluster-info.

II. Core Configuration: End-to-End Design

Ollama’s K8s deployment requires 4 key components: namespace isolation, persistent storage (PVC), Deployment, and Service exposure. Below are production-ready configurations with English comments—modify as needed.

1. Namespace: Resource Isolation

Create a dedicated namespace to separate Ollama resources from other services:

apiVersion: v1
kind: Namespace
metadata:
name: ai-services  # All Ollama resources reside here
labels:
  app: ollama      # Unified label for resource filtering

2. Persistent Storage (PVC): Avoid Model Loss

Ollama stores models in /root/.ollama by default. Use PVC to persist models across Pod restarts:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ai-services
spec:
accessModes:
  - ReadWriteOnce  # Use ReadWriteMany for multi-replica (requires storage class support like NFS)
resources:
  requests:
    storage: 50Gi  # 7B model ≈4GB, 13B≈13GB; reserve 20% redundancy for multiple models
storageClassName: "standard"  # Replace with your cluster's storage class (e.g., aws-ebs, gcp-pd, nfs-client)

3. Deployment: Core Service Orchestration

Manages Ollama container lifecycle (resource limits, health checks, environment variables):

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai-services
labels:
  app: ollama
spec:
replicas: 1  # Enable multi-replica only after solving storage sharing (e.g., NFS)
selector:
  matchLabels:
    app: ollama
strategy:
  type: Recreate  # Prevent model file conflicts in multi-replica scenarios
template:
  metadata:
    labels:
      app: ollama
  spec:
    containers:
    - name: ollama
      image: ollama/ollama:latest  # Use specific version in production (e.g., ollama/ollama:0.1.48)
       # For GPU acceleration: Use ollama/ollama:nvidia (requires NVIDIA device plugin)
      ports:
      - containerPort: 11434  # Default Ollama API port
      env:
      - name: OLLAMA_HOST
        value: "0.0.0.0"  # Allow external access (default: localhost only)
      - name: OLLAMA_MAX_LOADED_MODELS
        value: "2"  # Adjust based on available memory (default: 3)
      resources:
        requests:  # Minimum resources for scheduling
          cpu: "4"
          memory: "12Gi"  # 7B model: 8GB+, 13B model: 16GB+
        limits:  # Prevent resource contention
          cpu: "8"
          memory: "16Gi"
       # GPU configuration (uncomment if using NVIDIA image)
       # resources:
       #   limits:
       #     nvidia.com/gpu: 1
      volumeMounts:
      - name: model-storage
        mountPath: /root/.ollama  # Default Ollama model directory (do not modify)
        subPath: ollama  # Isolate directory for shared storage
      - name: cache-volume
        mountPath: /root/.cache/ollama  # Temporary cache for faster model loading
      livenessProbe:  # Restart Pod if service fails
        httpGet:
          path: /
          port: 11434
        initialDelaySeconds: 120  # Longer delay for model loading
        periodSeconds: 20
        timeoutSeconds: 5
      readinessProbe:  # Remove Pod from Service if unready
        httpGet:
          path: /
          port: 11434
        initialDelaySeconds: 60
        periodSeconds: 10
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000  # Allow write access to model directory
    affinity:
       # Prefer scheduling to GPU nodes (if available)
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: nvidia.com/gpu.present
              operator: In
              values: ["true"]
    tolerations:
       # Tolerate GPU node taints
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
    volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: ollama-models-pvc
    - name: cache-volume
      emptyDir:
        medium: Memory
        sizeLimit: 2Gi  # In-memory cache for faster inference

4. Service: Expose Ollama for External Access

Fixes temporary Pod IPs with a stable access address:

apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-services
labels:
  app: ollama
spec:
selector:
  app: ollama
ports:
- port: 11434
  targetPort: 11434
  protocol: TCP
  name: http
type: ClusterIP  # Modify for external access:
                  # - NodePort: Add nodePort: 30080 (range: 30000-32767)
                  # - LoadBalancer: For cloud environments (AWS ELB, GCP LB)

III. Deployment & Validation

1. Deploy Resources

Save all configurations to ollama-k8s-deploy.yaml and run:

kubectl apply -f ollama-k8s-deploy.yaml

Verify resource status (all components must be healthy):

# Check namespace
kubectl get ns | grep ai-services

# Check PVC (STATUS: Bound)
kubectl get pvc -n ai-services

# Check Deployment (READY: 1/1)
kubectl get deployment -n ai-services

# Check Pod (STATUS: Running, RESTARTS: 0)
kubectl get pods -n ai-services

Troubleshooting: If Pod is Pending, run kubectl describe pod <pod-name> -n ai-services to check for storage binding or resource shortages.

2. Pull Models

Once the Pod is running, pull your target model (e.g., Llama3-8B):

# Enter the Ollama container (replace <pod-name> with actual Pod name)
kubectl exec -it -n ai-services <pod-name> -- /bin/sh

# Pull model (e.g., Llama3-8B)
ollama pull llama3:8b

Check pull progress with:

kubectl logs -f <pod-name> -n ai-services

3. Test Service Access

Option 1: Cluster Internal Test (For inter-service calls)

# Run a test Pod
kubectl run -it busybox --image=busybox:1.35 -- /bin/sh

# Test model list API
wget -qO- http://ollama-service.ai-services:11434/api/tags

Option 2: Local Test (For development debugging)

Forward the Service port to your local machine:

kubectl port-forward -n ai-services service/ollama-service 11434:11434

Test the chat API with curl:

curl http://localhost:11434/api/chat -d '{
 "model": "llama3:8b",
 "messages": [{"role": "user", "content": "What is Kubernetes?"}]
}'

A successful response (JSON format) indicates the Ollama service is operational.

IV. Production Optimization Tips

1. Storage Optimization

  • Multi-replica scenarios: Use ReadWriteMany-supported storage (NFS, AWS EFS, GCP Filestore).

  • Performance: Use SSD for faster model loading (HDD may cause timeouts for 13B+ models).

2. Resource Tuning

  • CPU/Memory:

    • 7B model: 4-8 cores + 8-12GB memory

    • 13B model: 8-16 cores + 16-32GB memory

    • 34B model: 16+ cores + 64GB+ memory

  • GPU Acceleration: Use ollama/ollama:nvidia image (3-5x faster model loading, 50% lower latency).

3. Monitoring & Operations

  • Monitoring: Use Prometheus + Grafana to track model status, API throughput, and latency (Ollama exposes /metrics endpoint).

  • Log Collection: Forward container logs to ELK or Loki for troubleshooting.

  • High Availability: Add PodDisruptionBudget (PDB) to avoid service downtime during maintenance.

4. Security Hardening

  • RBAC Permissions: Restrict access to the ai-services namespace to authorized users.

  • Network Isolation: Use NetworkPolicy to allow access only from trusted services.

  • Image Security: Store Ollama images in a private registry to prevent tampering.

Summary

This deployment balances stability, scalability, and resource efficiency—suitable for both testing and production. Key advantages include:

  • Persistent models: Avoid repeated downloads via PVC.

  • Controllable resources: Prevent resource contention with CPU/memory limits.

  • Flexible scaling: Support single-replica debugging and multi-replica high availability.

Adjust configurations (storage capacity, resource limits, Service type) based on your cluster resources and model requirements to quickly launch lightweight AI services.

 

Image NewsLetter
Icon primary
Newsletter

Subscribe our newsletter

Please enter your email address below and click the subscribe button. By doing so, you agree to our Terms and Conditions.

Your experience on this site will be improved by allowing cookies Cookie Policy