Models & Providers Local Models

OpenClaw Local LLM: The Complete Off-Grid AI Agent Setup

Running agents without a cloud API means zero data exposure, zero per-token cost, and zero dependency on uptime you don't control. Here's the complete setup that gets OpenClaw running on a local model — hardware choices, config file, and the tradeoffs you need to know before you commit.

JD
J. Donovan
Technical Writer
Feb 1, 2025 14 min read 9.4k views
Updated Feb 1, 2025
Key Takeaways
  • OpenClaw connects to any local model server that exposes an OpenAI-compatible API — Ollama and LM Studio both qualify out of the box
  • Set provider: openai-compatible and point base_url at your local server to switch from cloud to local in under five minutes
  • 16GB RAM handles 7B models comfortably; 32GB or a GPU with 12GB+ VRAM is needed for reliable 13B performance
  • Mistral 7B Instruct and Llama 3 8B Instruct are the most reliable local models for OpenClaw agents as of early 2025
  • Tool calling works on newer instruction-tuned models — test it before deploying; older models silently produce malformed JSON

Most teams don't realize OpenClaw can run entirely offline until they're already paying $400/month in API bills. Three hours of setup cuts that to zero. Every token stays on your hardware. Every agent response never touches a third-party server. Here's exactly how to make that happen.

Why Local LLMs Change the Economics of Agent Deployments

Cloud LLM costs are predictable on paper and shocking in practice. An agent that summarizes 500 documents a day, checks email every 15 minutes, and runs a research pipeline burns through tokens faster than most teams expect. At GPT-4o prices, that's real money every month.

Local models change the math entirely. Once the hardware is paid for, inference costs nothing per token. A mid-range workstation running Mistral 7B can serve dozens of agent requests per hour at zero marginal cost. Teams with strict data governance requirements — healthcare, legal, finance — often have no choice: data cannot leave the building. Local models are the only viable path.

There's a quality tradeoff. Local 7B models don't match GPT-4o on complex reasoning. But for the majority of agent tasks — summarization, classification, document extraction, structured output generation — a well-prompted local model handles 80% of the workload at a fraction of the cost. Routing heavy reasoning tasks to a cloud model while keeping routine work local is the pattern power users actually run.

💡
Start with a Hybrid Setup

Don't commit to fully local on day one. Configure your development environment with a local model and keep a cloud fallback for tasks that require stronger reasoning. OpenClaw's provider config makes switching between them a one-line change — so you can benchmark real workloads before going all-in on local.

Hardware Requirements for OpenClaw Local LLM Setups

Hardware requirements depend entirely on which model you choose. The spec that matters most is not raw CPU speed — it's memory bandwidth. LLM inference is memory-bound, not compute-bound. A machine with fast RAM or a GPU with high VRAM bandwidth will outperform a machine with a faster CPU but slower memory.

Model Size Minimum RAM Recommended Expected Speed
7B (Q4)8GB RAM16GB RAM5–15 tok/s CPU
13B (Q4)16GB RAM32GB RAM or GPU 12GB VRAM2–8 tok/s CPU, 20–40 tok/s GPU
34B (Q4)32GB RAMGPU 24GB VRAM1–3 tok/s CPU, 10–20 tok/s GPU
70B (Q4)64GB RAMMulti-GPU or 64GB+ VRAMSlow CPU; GPU required for agents

Apple Silicon deserves a special mention. M-series Macs use unified memory shared between CPU and GPU. A MacBook Pro with 32GB unified memory handles 13B models at 15–25 tokens per second through Metal acceleration. For teams without a dedicated GPU workstation, an M3 Pro Mac is currently the most cost-effective local inference machine available.

Choosing the Right Local Model for OpenClaw

Not every model works well inside an agent framework. OpenClaw agents depend on the model following instruction formats precisely, maintaining multi-turn context, and — for tool-using agents — producing valid JSON for function calls. Models that drift from instruction format or hallucinate JSON fields break agent pipelines silently.

Here's what we've seen consistently: models fine-tuned specifically for instruction-following outperform base models by a wide margin in agent scenarios, even when the base model has more parameters. A well-tuned 7B instruct model beats a base 13B every time for structured agent tasks.

Reliable choices as of early 2025:

  • Mistral 7B Instruct v0.3 — best all-around 7B for agent tasks, solid tool calling support, fast on CPU
  • Llama 3 8B Instruct — Meta's strongest small model, excellent instruction adherence, good JSON output
  • Llama 3 70B Instruct — near GPT-4 quality for reasoning tasks, requires serious hardware
  • Phi-3 Mini (3.8B) — runs on low-end hardware, surprisingly capable for simple agent tasks, context window limitations
  • Qwen2 7B Instruct — strong multilingual support, good for non-English agent deployments

Sound familiar? You've probably already tried one of these and hit context window problems or broken tool calls. That's where the config matters — and we'll get to the exact setup in a moment.

OpenClaw Config for Local LLM

OpenClaw treats any OpenAI-compatible API endpoint as a valid model provider. Local model servers like Ollama and LM Studio expose exactly this format. The configuration change is minimal.

# openclaw.config.yaml — local LLM setup

model:
  provider: openai-compatible
  base_url: http://localhost:11434/v1
  model: mistral:7b-instruct-v0.3-q4_K_M
  api_key: local          # required field, value ignored by local servers
  context_window: 8192
  temperature: 0.2
  max_tokens: 2048

# For LM Studio, change base_url to:
# base_url: http://localhost:1234/v1

The api_key field is required by OpenClaw's config schema even for local providers. Set it to any non-empty string — local servers ignore it entirely.

Alternatively, use the CLI to set these values directly:

openclaw config set model.provider=openai-compatible
openclaw config set model.base_url=http://localhost:11434/v1
openclaw config set model.model=mistral:7b-instruct-v0.3-q4_K_M
openclaw config set model.api_key=local
openclaw config set model.temperature=0.2
⚠️
Start Your Local Server Before OpenClaw

OpenClaw validates the model connection at startup. If your Ollama or LM Studio server isn't running when OpenClaw starts, the gateway fails to initialize and all agents go offline. Add your local model server to your system's startup services, or use a process manager like PM2 to ensure it starts automatically.

Performance Tuning for Local Inference

Raw token speed matters less for agent workloads than most people expect. Agents spend more time waiting for tool execution than waiting for model output. That said, there are several config changes that meaningfully improve perceived responsiveness.

Use quantized models. Q4_K_M quantization is the sweet spot — it cuts model size roughly in half versus full precision with minimal quality loss. Q5_K_M gives slightly better quality if you have the VRAM. Avoid Q2 quantization for agent tasks; the quality degradation at that level causes instruction-following failures.

Reduce context window size. Most agent tasks don't need a 32k context window. Setting context_window: 4096 in your config reduces memory pressure and speeds up both prefill and generation. Only increase it for specific agents that handle long documents.

Here's where most people stop — they set up the model, verify it responds, and call it done. The next step is tuning the system prompt length. Long system prompts with dozens of tool descriptions slow prefill time significantly on smaller models. Keep system prompts under 500 tokens for local deployments, and only include tool definitions for tools the specific agent actually uses.

Enable GPU layers in Ollama. If you have a GPU, confirm Ollama is actually using it:

# Check GPU usage in Ollama
ollama run mistral --verbose

# Force GPU layers in Ollama config
# ~/.ollama/config.json
{
  "num_gpu": 35,
  "num_thread": 8
}

Common Mistakes With OpenClaw Local LLM Setups

  • Using a base model instead of an instruct model — base models don't follow the chat format OpenClaw uses. Always use the -instruct or -chat variant of any model.
  • Setting temperature too high — agent pipelines need consistent, predictable output. Keep temperature between 0.1 and 0.3 for tool-using agents. High temperature causes JSON formatting errors in tool calls.
  • Not testing tool calling before deploying — send a test message that requires a tool call before putting any agent into production. Silent tool call failures are the hardest category of local LLM bugs to diagnose.
  • Ignoring VRAM fragmentation — if you're running other GPU workloads alongside Ollama, VRAM fragmentation can cause OOM errors mid-inference. Dedicate a machine or partition GPU resources explicitly.
  • Not setting num_ctx in Ollama — Ollama defaults to a 2048-token context window. Agent system prompts alone often exceed this. Set num_ctx: 8192 in your Modelfile or Ollama config to avoid silent truncation.

Frequently Asked Questions

Can OpenClaw run completely offline with a local LLM?

Yes. OpenClaw supports fully offline operation when paired with a local model server like Ollama or LM Studio. Once your model is downloaded and your config points to the local endpoint, no internet connection is required for agent inference. Your data never leaves the machine.

What hardware do I need to run a local LLM with OpenClaw?

For a 7B model, 16GB RAM and a modern CPU are the minimum — expect 3–8 tokens per second. For a 13B model, 32GB RAM or a GPU with 12GB VRAM is recommended. Apple Silicon Macs with 16GB+ unified memory handle 13B models well via Metal acceleration.

Which local models work best with OpenClaw agents?

Mistral 7B Instruct and Llama 3 8B Instruct are the most reliable starting points as of early 2025. Both follow instruction formats correctly, handle multi-turn conversation well, and produce consistent JSON outputs — which OpenClaw agents depend on for tool calls.

How do I tell OpenClaw to use a local model instead of a cloud API?

Set the model provider to openai-compatible in your openclaw config, point base_url at your local server (e.g., http://localhost:11434/v1), set model to your local model name, and use any string as the api_key since local servers don't enforce authentication.

Will tool calling work with local models in OpenClaw?

It depends on the model. Mistral 7B Instruct v0.3 and Llama 3 models support function calling natively. Older models may produce malformed JSON for tool calls. Always test with a simple tool call before deploying a local model agent to production workloads.

How do I improve response speed with a local LLM in OpenClaw?

Use GPU acceleration if available — even a mid-range GPU cuts inference time dramatically. Reduce context window size in your Ollama or LM Studio settings. Use a quantized model (Q4_K_M is the sweet spot for quality vs speed). Keep system prompts short to reduce prefill time.

You now have everything needed to run OpenClaw agents entirely on local hardware — model selection criteria, hardware benchmarks, the exact config block, and the performance tuning steps that most guides skip. The difference between a local setup that frustrates you and one that you trust for real workloads is almost always one of the config details above.

Start with Ollama and Mistral 7B Instruct. Get one agent working end-to-end. Then expand from there. The setup takes less than an hour — and from that point, every token is yours.

JD
J. Donovan
Technical Writer

J. Donovan has documented local LLM deployments across air-gapped enterprise environments, home lab setups, and edge devices. Has benchmarked over 40 model variants against OpenClaw agent pipelines and maintains the community model compatibility matrix.

Local AI Agent Guides

Weekly guides on running OpenClaw with local models, free.