Tutorial

How to Use Local LLMs with OpenClaw (Ollama, Llama, Mistral)

Step-by-step guide to running OpenClaw with local LLM models via Ollama. Use Llama 3, Mistral, Phi-3, and other open-weight models for completely private AI automation.

By OpenClaw Team ¡

How to Use Local LLMs with OpenClaw (Ollama, Llama, Mistral)

Running AI assistants with cloud APIs (OpenAI, Anthropic) is convenient but sends every conversation to third-party servers. For complete privacy, zero ongoing costs, or air-gapped deployments, local LLMs (Large Language Models) are the solution. This guide shows you how to run OpenClaw with Ollama-powered local models—achieving ChatGPT-like AI entirely on your own hardware.

Local LLMs with OpenClaw offer complete privacy (conversations never leave your device), zero per-message costs (pay only for hardware/electricity), unlimited usage (no rate limits or quotas), offline operation (works without internet), and customization freedom (fine-tune models for specific domains). Whether you’re privacy-conscious, cost-sensitive, or working in secure environments, local LLMs unlock powerful AI without compromise.

What Are Local LLMs?

Local LLMs are large language models that run on your own computer, server, or device rather than cloud servers. Recent models like Meta’s Llama 3, Mistral AI’s Mistral 7B, Microsoft’s Phi-3, and Google’s Gemma deliver impressive performance—often matching GPT-3.5 quality and approaching GPT-4 for many tasks—while running on consumer hardware.

How they work: Models are downloaded to your machine (typically 4-70 GB depending on size), loaded into RAM/VRAM, and inference happens locally using CPU or GPU. Tools like Ollama make this process simple, handling model management, optimization, and API interfaces automatically.

Quality vs. cloud LLMs: As of 2026, local models excel at general conversation, coding assistance, text analysis, and summarization. They lag behind cloud models (GPT-4, Claude Opus) for extremely complex reasoning, extensive world knowledge, and cutting-edge capabilities. For most personal and business use cases, local models provide 80-95% of cloud model quality at zero ongoing cost.

Why Run OpenClaw with Local LLMs?

Complete Privacy

Every message sent to OpenAI, Anthropic, or Google passes through their servers. Even with strong privacy policies, your data exists outside your control. Governments can subpoena cloud provider data. Breaches happen. Fine print changes.

Local LLMs eliminate these concerns entirely. Conversations happen on your hardware. No external servers see your queries, customer data, or proprietary information. For sensitive use cases—healthcare conversations, legal research, financial analysis, HR discussions, internal business strategy—local LLMs are the only way to guarantee privacy.

Zero Recurring Costs

Cloud AI APIs charge per token. At scale, costs add up quickly. A business processing 50,000 messages monthly might spend $500-$2,000 on API calls. Personal power users hit $50-$100 monthly.

Local LLMs eliminate these costs. After initial hardware investment (or using existing computers), inference is free. Run unlimited conversations without watching meters or worrying about quota exhaustion. For high-volume use cases, ROI happens within 3-6 months compared to cloud APIs.

Offline and Air-Gapped Operation

Internet outages, remote locations, secure facilities, and regulated environments often prohibit cloud connectivity. Local LLMs work completely offline. No network = no problem.

Deploy OpenClaw with local models in submarines, rural areas, airplanes, secure government facilities, industrial environments, or anywhere connectivity is unreliable or prohibited. The assistant functions identically whether connected to the internet or not.

Unlimited Usage

Cloud APIs impose rate limits (requests per minute/day) and throttle high-volume users. During peak usage, you might hit limits and need to wait.

Local models have no artificial limits. Your only constraint is hardware capability. Process thousands of requests concurrently if your hardware supports it. No quotas, no throttling, no waiting.

Customization and Fine-Tuning

Cloud models are general-purpose, trained on broad internet data. For specialized domains—medical terminology, legal language, company-specific jargon, niche industries—you may need customization.

Local models can be fine-tuned on your data, creating domain-specific AI assistants. Train on your documentation, support tickets, or industry corpus to improve relevance and accuracy for your specific use case.

Hardware Requirements

Local LLM performance depends entirely on hardware. Here’s what you need:

Minimum Specs (Small Models: 7B parameters)

CPU: 4+ cores (Intel i5/AMD Ryzen 5 or better) RAM: 8 GB minimum (16 GB recommended) Storage: 10 GB free space GPU: Optional (CPU inference works but is slower)

Performance: 1-5 tokens/second. Acceptable for personal use, low-volume conversations. Models: Phi-3 mini, Mistral 7B, Llama 3 8B.

Example hardware: MacBook Air M2, mid-range Windows laptop, Raspberry Pi 5 (8GB).

CPU: 8+ cores (Intel i7/AMD Ryzen 7, Apple M1/M2/M3) RAM: 16-32 GB Storage: 25 GB free space GPU: 8+ GB VRAM (NVIDIA RTX 3060, AMD RX 6700, Apple Silicon)

Performance: 10-30 tokens/second. Good conversational experience, suitable for small business or team use. Models: Llama 3 70B (quantized), Mistral 22B, Mixtral 8x7B.

Example hardware: MacBook Pro M3, gaming desktop with RTX 4070, workstation with 32GB RAM.

High-Performance Specs (Large Models: 70B+ parameters)

CPU: 16+ cores RAM: 64-128 GB Storage: 50+ GB SSD GPU: 24+ GB VRAM (NVIDIA RTX 4090, A100, H100)

Performance: 30-100+ tokens/second. Near-instant responses, supports concurrent users, enterprise-grade. Models: Llama 3 70B, Qwen 2.5 72B, Mixtral 8x22B.

Example hardware: High-end workstation, dedicated AI server, cloud GPU instances (Vast.ai, RunPod).

GPU vs CPU Inference

GPU inference is 10-50x faster than CPU but requires compatible hardware (NVIDIA CUDA or AMD ROCm, or Apple Metal). Modern consumer GPUs (RTX 4060+) dramatically improve experience.

CPU inference works on any computer but is slower. Acceptable for low-frequency use (few queries per hour). Not suitable for real-time conversations or high volume.

Recommendation: If you have GPU with 8+ GB VRAM, use it. Otherwise, start with CPU inference and upgrade if speed becomes issue.

Installing Ollama

Ollama is the easiest way to run local LLMs. It handles model downloads, management, optimization (quantization, context caching), and provides OpenAI-compatible API for easy integration.

Step 1: Download and Install Ollama

macOS:

# Download from ollama.com or use Homebrew
brew install ollama

# Verify installation
ollama --version

Linux:

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

Windows: Download installer from ollama.com/download and run. Ollama runs as background service.

Step 2: Download Your First Model

# Download Llama 3 8B (4.7 GB, good starting point)
ollama pull llama3

# Or download Mistral 7B (4.1 GB, faster)
ollama pull mistral

# Or download Phi-3 mini (2.3 GB, smallest capable model)
ollama pull phi3

First download takes 5-15 minutes depending on internet speed. Subsequent model changes are instant.

Step 3: Test the Model

# Start interactive chat
ollama run llama3

# Try a query
>>> Who was the first person on the moon?

# Exit with /bye
>>> /bye

If you get intelligent responses, Ollama is working correctly.

Step 4: Start Ollama Server (For OpenClaw Integration)

# Start Ollama API server (runs on localhost:11434)
ollama serve

Keep this terminal window open or run as background service. OpenClaw will connect to this API.

macOS/Linux background service:

# Ollama auto-starts on boot after installation
# Check status
systemctl status ollama  # Linux
# or
launchctl list | grep ollama  # macOS

Configuring OpenClaw with Ollama

Now connect OpenClaw to your local Ollama models.

Step 1: Install OpenClaw

If you haven’t already:

npm install -g openclaw
openclaw init my-local-ai-bot
cd my-local-ai-bot

Step 2: Configure AI Model Settings

Edit openclaw.config.yaml:

name: openclaw-local-llm
version: 1.0.0

# AI Model Configuration - Ollama
ai:
  provider: ollama
  model: llama3  # or 'mistral', 'phi3', etc.
  base_url: http://localhost:11434  # Ollama API endpoint
  temperature: 0.7
  max_tokens: 2048

# Platform Configuration
platforms:
  - type: telegram  # Or whatsapp, discord, etc.
    enabled: true

# Optional: Response streaming for better UX
streaming: true

Key settings:

  • provider: ollama - Use Ollama instead of OpenAI/Anthropic
  • model: llama3 - Which Ollama model to use
  • base_url - Where Ollama API is running (localhost for local)
  • streaming: true - Show responses as they generate (like ChatGPT typing effect)

Step 3: Set Up Environment Variables

Create .env file:

# No API keys needed for Ollama!
# Optional: If you want to use cloud models as fallback
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...

Unlike cloud APIs, local LLMs require no API keys or authentication. Just works.

Step 4: Start OpenClaw

openclaw start

OpenClaw connects to Ollama and is ready to chat. Test by sending a message on your configured platform (Telegram, WhatsApp, etc.).

Choosing the Right Model

Ollama supports dozens of models. Here’s how to choose:

For General Use (Balanced Performance and Quality)

Llama 3 8B (ollama pull llama3):

  • Size: 4.7 GB
  • Speed: Fast (20-40 tokens/sec on good hardware)
  • Quality: Excellent for general conversation, Q&A, summarization
  • Best for: Personal assistants, customer support, general automation

Mistral 7B (ollama pull mistral):

  • Size: 4.1 GB
  • Speed: Very fast (30-50 tokens/sec)
  • Quality: Strong reasoning, good for technical topics
  • Best for: Coding help, technical documentation, analysis

For Maximum Quality (Slower but Smarter)

Llama 3 70B (ollama pull llama3:70b):

  • Size: 39 GB (quantized)
  • Speed: Slower (5-15 tokens/sec on GPU, 1-3 on CPU)
  • Quality: Approaches GPT-4, excellent reasoning
  • Best for: Complex queries, professional writing, detailed analysis
  • Requires: 64+ GB RAM or 24+ GB VRAM GPU

Mixtral 8x7B (ollama pull mixtral):

  • Size: 26 GB
  • Speed: Moderate (10-25 tokens/sec)
  • Quality: Excellent, multilingual (French, German, Spanish, Italian)
  • Best for: International use, code generation, multi-language support

For Speed and Low Resources

Phi-3 Mini (ollama pull phi3):

  • Size: 2.3 GB
  • Speed: Very fast (40-80 tokens/sec)
  • Quality: Good for simple tasks, surprisingly capable for size
  • Best for: Raspberry Pi, old laptops, quick responses
  • Limitations: Weaker reasoning than larger models

Gemma 2B (ollama pull gemma:2b):

  • Size: 1.4 GB
  • Speed: Extremely fast (60-100+ tokens/sec)
  • Quality: Basic but functional
  • Best for: Simple FAQ bots, extremely constrained hardware

For Specialized Tasks

Code Llama (ollama pull codellama):

  • Optimized for programming, code completion, debugging
  • Best for: Developer assistants, code review bots

Llava (ollama pull llava):

  • Multimodal model (text + images)
  • Best for: Image analysis, visual Q&A, accessibility tools

How to Switch Models

# Download new model
ollama pull mistral

# Update openclaw.config.yaml
# Change: model: llama3
# To: model: mistral

# Restart OpenClaw
openclaw restart

No code changes needed—just configuration update.

Advanced Configuration

Multi-Model Routing (Use Different Models for Different Tasks)

OpenClaw can intelligently route queries to appropriate models:

ai:
  routing:
    simple_queries:
      provider: ollama
      model: phi3  # Fast, cheap model for FAQs
      triggers: ["greeting", "simple_lookup", "faq"]

    complex_queries:
      provider: ollama
      model: llama3:70b  # Smart model for hard questions
      triggers: ["reasoning", "analysis", "complex"]

    coding_queries:
      provider: ollama
      model: codellama  # Specialized model for code
      triggers: ["code", "programming", "debug"]

    fallback:
      provider: anthropic  # Cloud API if local fails
      model: claude-3-5-haiku-20241022

This optimizes speed (simple queries answered instantly by small model) and quality (complex queries routed to powerful model) while minimizing cloud API costs (only fallback for failures).

Performance Tuning

Adjust context window (how much conversation history to remember):

ai:
  provider: ollama
  model: llama3
  max_tokens: 2048  # Maximum response length
  context_window: 4096  # How much history to keep

Larger context = better memory but slower inference.

Quantization settings (Ollama auto-handles this, but you can specify):

# Download 4-bit quantized version (smaller, faster, slightly lower quality)
ollama pull llama3:8b-instruct-q4_0

# Download 8-bit version (larger, slower, higher quality)
ollama pull llama3:8b-instruct-q8_0

Concurrent request handling:

ai:
  provider: ollama
  model: llama3
  max_concurrent_requests: 3  # How many simultaneous conversations

Set based on hardware capability. More concurrent requests require more RAM.

GPU Selection (Multi-GPU Systems)

If you have multiple GPUs:

# Use specific GPU
CUDA_VISIBLE_DEVICES=1 ollama serve

# Use multiple GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve

Custom System Prompts for Domain Specialization

Tailor model behavior without fine-tuning:

instructions: |
  You are a medical assistant AI helping healthcare professionals.

  IMPORTANT GUIDELINES:
  - Always cite medical sources when possible
  - Use correct medical terminology
  - For drug interactions, recommend consulting databases
  - Never diagnose - suggest consulting licensed physicians
  - Prioritize patient safety in all responses

  Your knowledge includes:
  - Anatomy and physiology
  - Common medical conditions and treatments
  - Drug information and interactions
  - Medical procedures and protocols

This “primes” the model to respond appropriately for specialized domains.

Optimizing Performance

Speed Optimization Techniques

1. Use quantized models: 4-bit quantization reduces size by 75% with minimal quality loss.

ollama pull llama3:8b-q4_K_M  # Medium quantization

2. Enable GPU acceleration: Verify GPU is being used.

# Check Ollama GPU usage
nvidia-smi  # NVIDIA
rocm-smi    # AMD

# You should see ollama process using GPU memory

3. Reduce context window: Less context = faster inference.

ai:
  context_window: 2048  # Instead of 4096

4. Use smaller models for simple tasks: Route simple queries to Phi-3, complex to Llama 3 70B.

5. Preload models: Keep model in memory instead of loading per request.

# Preload model (stays in RAM)
ollama run llama3

# Leave this running in background

Memory Optimization

For limited RAM systems:

ai:
  provider: ollama
  model: phi3  # Smaller model
  max_concurrent_requests: 1  # Only one conversation at a time
  context_window: 2048  # Smaller context

Monitor memory usage:

# Linux
htop

# macOS
Activity Monitor

# Look for ollama process memory consumption

If running out of memory:

  • Use smaller model (Phi-3 instead of Llama 3)
  • Reduce concurrent requests
  • Close other applications
  • Upgrade RAM (local LLMs love RAM)

Troubleshooting

Ollama Connection Failed

Error: “Failed to connect to Ollama at localhost:11434”

Solution:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

# Check port isn't blocked
lsof -i :11434

Slow Response Times

Issue: Responses taking 10+ seconds per message.

Causes and fixes:

  1. CPU inference: Upgrade to GPU or use smaller model (Phi-3)
  2. Insufficient RAM: Close other apps, use smaller model
  3. Large context window: Reduce context_window in config
  4. Wrong quantization: Use Q4 quantization for faster inference

Model Download Failures

Error: “Download interrupted” or “Connection timeout”

Solution:

# Resume interrupted download
ollama pull llama3

# If persistent, try different mirror
OLLAMA_MIRRORS=https://mirror.example.com ollama pull llama3

# Or download manually and import

“Out of Memory” Errors

Error: Ollama crashes with OOM (out of memory)

Solution:

# Use smaller quantization
ollama pull llama3:8b-q4_K_M  # Instead of q8

# Or switch to smaller model
ollama pull phi3  # Instead of llama3:70b

# Check available RAM
free -h  # Linux
vm_stat  # macOS

Poor Response Quality

Issue: Responses are nonsensical or low-quality.

Causes and fixes:

  1. Wrong model: Some models perform poorly on certain tasks. Try different model.
  2. Extreme quantization: Q2 quantization sacrifices too much quality. Use Q4 or higher.
  3. Insufficient context: Model doesn’t remember conversation. Increase context_window.
  4. Poor prompt: Improve system instructions to guide model behavior.

GPU Not Being Used

Issue: Ollama using CPU despite having GPU.

Solution:

# Verify GPU drivers installed
nvidia-smi  # NVIDIA
rocm-smi    # AMD

# Reinstall Ollama with GPU support
curl -fsSL https://ollama.com/install.sh | sh

# Check Ollama detected GPU
ollama run llama3
# Look for "GPU: NVIDIA GeForce RTX..." in startup

Cost Analysis: Local vs Cloud

Initial Investment

Local LLM setup:

  • Existing hardware: $0 (use laptop/desktop you already own)
  • Budget upgrade: $500-1,500 (better RAM, mid-range GPU)
  • High-performance: $2,000-5,000 (workstation with 24GB+ VRAM GPU)

Cloud APIs:

  • $0 upfront
  • Pay per use starting immediately

Monthly Costs

Local LLM (after hardware purchase):

  • Electricity: $5-20/month (depends on usage and hardware)
  • Maintenance: $0-10/month (occasional updates, monitoring)
  • Total: $5-30/month

Cloud APIs (moderate use, 10,000 messages/month):

  • GPT-3.5 Turbo: ~$50/month
  • GPT-4 Turbo: ~$200/month
  • Claude Sonnet: ~$80/month

Cloud APIs (high use, 100,000 messages/month):

  • GPT-3.5: ~$500/month
  • GPT-4: ~$2,000/month
  • Claude: ~$800/month

Break-Even Analysis

Scenario: Moderate use (10,000 messages/month)

Setup CostCloud MonthlyLocal MonthlyBreak-Even
$0 (existing hardware)$80$10Immediate
$1,000 (RAM upgrade + mid GPU)$80$1514 months
$3,000 (high-end build)$80$2050 months

Scenario: High use (100,000 messages/month)

Setup CostCloud MonthlyLocal MonthlyBreak-Even
$0 (existing hardware)$800$25Immediate
$1,000$800$251.3 months
$3,000$800$303.9 months

For high-volume users, local LLMs pay for themselves in weeks to months.

Real-World Examples

Example 1: Privacy-Focused Personal Assistant

Setup: MacBook Pro M3 Max (96GB RAM) running Llama 3 70B

Configuration:

ai:
  provider: ollama
  model: llama3:70b
  temperature: 0.7

platforms:
  - type: telegram
  - type: whatsapp
  - type: discord

skills:
  - calendar
  - notes
  - web-search (local search engine)
  - file-management

Results: Complete privacy for personal conversations, calendar, emails, and documents. Zero data sent to third parties. No monthly costs. Performance comparable to GPT-4 for personal productivity tasks.

Example 2: Small Business Customer Support

Setup: Intel i7 workstation (32GB RAM) with RTX 3060 (12GB VRAM), Mistral 7B

Configuration:

ai:
  provider: ollama
  model: mistral

platforms:
  - type: whatsapp

skills:
  - knowledge-base-search
  - order-lookup
  - appointment-scheduling

rag:
  enabled: true
  vector_store: chroma
  sources:
    - type: local_files
      path: ./company-docs

Results: 24/7 customer support automation, 70% of inquiries handled without human intervention, $0 monthly AI costs (vs. $300 projected for cloud APIs), complete control over business data.

Example 3: Offline Field Service Assistant

Setup: Raspberry Pi 5 (8GB) running Phi-3 Mini

Configuration:

ai:
  provider: ollama
  model: phi3
  max_concurrent_requests: 1

platforms:
  - type: terminal  # Command-line interface
  - type: voice     # Hands-free operation

skills:
  - equipment-manual-search
  - troubleshooting-guide
  - parts-lookup

Results: Field technicians access AI assistant for equipment troubleshooting in remote locations without internet. Instant access to manuals, procedures, and diagnostic guidance. Runs on battery-powered Pi for 8+ hours.

FAQ

Can local LLMs match ChatGPT quality?

For most tasks, yes. Llama 3 70B approaches GPT-4 quality for general conversation, analysis, and writing. Smaller models (Llama 3 8B, Mistral 7B) match GPT-3.5 quality. Local models lag behind cutting-edge cloud models (GPT-4, Claude Opus) for extremely complex reasoning, extensive world knowledge, and latest capabilities. For 80-90% of use cases, local quality is sufficient. See our ChatGPT alternative comparison for detailed quality analysis.

How much does electricity cost for running local LLMs?

Electricity costs are minimal for typical usage. Idle (model loaded, not generating): 10-50 watts ($1-5/month if running 24/7). Active inference (generating responses): ~100-300 watts during generation (only when responding, not constant). Daily usage (1-2 hours active): ~$0.10-0.50/month. Heavy usage (8+ hours daily): ~$5-15/month. High-end GPU workstation running continuously: ~$15-30/month. Compare to cloud APIs at $50-500/month—electricity is negligible.

Can I run local LLMs on Raspberry Pi?

Yes, with limitations. Raspberry Pi 5 (8GB RAM) can run smaller models like Phi-3 Mini (2.3GB) and Gemma 2B acceptably. Performance is slow (5-10 tokens/second) but functional for low-frequency queries. Larger models (Llama 3 8B, Mistral 7B) require quantization and run very slowly. Not suitable for real-time conversations or concurrent users. Best for: offline field devices, hobbyist projects, educational purposes. See our Raspberry Pi AI guide for detailed setup.

What’s the difference between Ollama and other local LLM tools?

Ollama is specifically designed for ease of use with automatic model management, simple CLI interface, OpenAI-compatible API (easy integration), and cross-platform support (Mac, Linux, Windows). Alternatives include llama.cpp (lower-level, more control, steeper learning curve), LM Studio (desktop GUI similar to ChatGPT), GPT4All (desktop app with curated models), and Text generation web UI (web interface with advanced features). Ollama is the best choice for OpenClaw integration due to API compatibility and simplicity.

Can I fine-tune local models for my specific use case?

Yes, but it’s advanced. Fine-tuning requires ML expertise, training data (hundreds to thousands of examples), GPU with 16+ GB VRAM, time (hours to days depending on model size and data), and tools (Hugging Face transformers, Axolotl, Unsloth). For most users, better approach is custom system prompts (easy, immediate, no training required), RAG with company documents (adds domain knowledge without model changes), and few-shot examples (provide examples in prompts). Fine-tuning worth considering for highly specialized domains with large proprietary datasets.

How do I update Ollama models?

Models improve over time with new releases. Update models using pull command:

# Check for updates
ollama list

# Update specific model
ollama pull llama3

# Update all models
ollama pull --all

Models are versioned. Update won’t break existing configuration (API remains compatible).

Can I use multiple local models simultaneously?

Yes, run different models for different tasks to optimize performance and quality. Load models using routing configuration or run multiple Ollama instances on different ports. Be aware that each loaded model consumes RAM proportional to its size. On a 32GB RAM machine, you can comfortably run Phi-3 (2GB) + Mistral 7B (4GB) + Llama 3 8B (5GB) simultaneously (~11GB total, leaving 20GB for system).

Is local LLM inference secure?

Running models locally is more secure than cloud APIs (no data transmission to third parties, no server-side logging or storage, complete control over access and usage, can run air-gapped offline). However, you’re responsible for system security (keep OS and software updated, use firewalls, implement access controls, encrypt storage if handling sensitive data). For maximum security, combine local LLMs with self-hosted OpenClaw setup on isolated network.


Next Steps

You now have everything needed to run OpenClaw with local LLMs for completely private, cost-free AI automation.

To get started:

  1. Install Ollama on your computer
  2. Download your first model: ollama pull llama3
  3. Install OpenClaw and configure for Ollama
  4. Start chatting with your private AI assistant

For advanced setups:

Join the community:

  • Star OpenClaw on GitHub
  • Share your local LLM setup in Discussions
  • Contribute model configurations and optimizations

Local LLMs democratize AI by removing costs, protecting privacy, and ensuring availability. With OpenClaw and Ollama, powerful AI assistance is just minutes away—no subscriptions, no data sharing, no limits.

Start building your private AI future today.

Ready to Get Started?

Install OpenClaw and build your own AI assistant today.

Related Articles