OpenClaw + Ollama: Run Fully Private AI Agents Offline

Q: How do I set the Ollama endpoint in OpenClaw config?

In your agent YAML, set model_provider to openai-compatible, base_url to http://localhost:11434/v1, and model to the Ollama model name (e.g., llama3.1:8b). Set api_key to any non-empty string — Ollama ignores it but OpenClaw requires it to be present.

Key Takeaways

OpenClaw connects to Ollama through its OpenAI-compatible endpoint — no special plugin or patch required
Set base_url to http://localhost:11434/v1 and pick any model Ollama has pulled
Llama 3.1 8B handles most agent tasks well; move to 14B+ models for complex tool-calling chains
Your prompts, context, and responses never leave your machine — full data sovereignty
Each OpenClaw agent YAML can point to a different Ollama model, letting you mix model sizes per task

Most AI agent setups have a silent tax: every token you send hits an external API, logs somewhere, costs something. Three deployments I managed for clients in regulated industries couldn't accept that. The answer every time was the same — Ollama running locally, OpenClaw sitting on top, the whole stack air-gapped from the internet. Here's exactly how to build it.

Why Pair OpenClaw With Ollama

Ollama solves the hardest part of running local models: it packages model weights, inference runtime, and a clean API into one tool you install like any other application. Pull a model with one command. Get an OpenAI-compatible API at localhost:11434. Done.

OpenClaw connects to any OpenAI-compatible endpoint. That means it speaks to Ollama natively — no translation layer, no compatibility shim. The models Ollama supports include Llama 3.1, Mistral, Qwen2.5, Phi-3, Gemma 2, and dozens more. As of early 2025, the Ollama model library has grown to over 100 models optimized for consumer and workstation hardware.

The practical benefits of this combination are real. No per-token costs means you can run agents that process long documents without watching a billing meter. No outbound requests means HIPAA, GDPR, and internal data classification rules are far easier to satisfy. No API rate limits means your agents can run at whatever speed your hardware allows.

💡

Hardware Reality Check

A 7B model runs comfortably on 8GB VRAM. A 13B model needs 10–12GB. A 70B model requires 40GB+ (or quantized to Q4 with ~24GB). Apple Silicon Macs handle these ranges well through unified memory. CPU-only inference works but expect 15–45 second response times for 7B models.

Install Ollama and Pull Your First Model

Ollama installs as a single binary on macOS, Linux, and Windows. On macOS, the recommended method is the official installer from ollama.com. On Linux, one shell command handles the full install.

# Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Verify the service is running
ollama --version

# Pull Llama 3.1 8B (4.7GB download)
ollama pull llama3.1:8b

# Test it works
ollama run llama3.1:8b "Say hello in one sentence"

Once Ollama is running, it automatically starts an HTTP server at http://localhost:11434. The /v1/chat/completions endpoint follows the OpenAI API schema exactly. That's the endpoint OpenClaw will target.

You don't need to keep a terminal open with ollama serve — on macOS and Windows, Ollama runs as a background service after install. On Linux, enable the systemd service with sudo systemctl enable ollama --now to start it automatically on boot.

Choosing the Right Model for Your Agent Tasks

Not all models perform equally on agent tasks. Agent workflows require instruction-following, tool-calling, multi-step reasoning, and output formatting. General chat models don't always handle these well.

Model	Size	Best For	Min VRAM
llama3.1:8b	8B	General agent tasks, fast responses	6GB
mistral:7b	7B	Instruction following, low latency	5GB
qwen2.5:14b	14B	Tool calling, complex reasoning	10GB
llama3.1:70b	70B (Q4)	High-stakes reasoning, analysis	24GB
phi3:mini	3.8B	Simple classification, extraction	3GB

The pattern we've seen consistently: start with llama3.1:8b for your first integration. If tool-calling reliability is below 85% on your test prompts, move to qwen2.5:14b. That jump in size produces a significant improvement in structured output and function-call accuracy without requiring enterprise-grade hardware.

Configure OpenClaw to Use Ollama

OpenClaw's model configuration lives in each agent YAML file. The key is setting model_provider to openai-compatible and pointing base_url at Ollama's API endpoint.

# agents/researcher.yaml
name: researcher
model_provider: openai-compatible
base_url: http://localhost:11434/v1
api_key: ollama          # Any non-empty string works; Ollama ignores auth
model: llama3.1:8b
temperature: 0.3
max_tokens: 2048

system_prompt: |
  You are a research agent. Your job is to gather, synthesize,
  and summarize information on topics provided by the user.
  Always cite your sources and flag uncertainty explicitly.

The api_key field must be present — OpenClaw will error if it's missing — but Ollama ignores its value. Any non-empty string works. We use ollama as a reminder of what the provider is.

If you're running Ollama on a different machine (a dedicated inference server on your local network), replace localhost with that machine's IP address. Everything else stays the same. This is how teams run a single Ollama server and point multiple OpenClaw instances at it.

⚠️

Model Name Must Match Exactly

OpenClaw passes the model name directly to Ollama's API. If you write llama3.1 but pulled llama3.1:8b, the request fails with a 404. Run ollama list and copy the exact model name including the tag.

Test the Connection End-to-End

Before running a full agent workflow, confirm the connection works at each layer.

First, verify Ollama's API is accessible directly:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Say OK"}]
  }'

You should see a JSON response with the model's output. If you get a connection refused error, Ollama isn't running — start it with ollama serve or check the systemd service status.

Next, start OpenClaw and send a test message to your configured agent. Watch the OpenClaw logs — you should see the request forwarded to localhost:11434 and a response returned. Response time varies from 2–3 seconds (GPU, 8B model) to 30+ seconds (CPU-only, larger models).

Common Mistakes With OpenClaw + Ollama

Ollama not running when OpenClaw starts — OpenClaw tries to connect at startup. If Ollama isn't up yet, the agent initialization fails. Always start Ollama before OpenClaw, or use the systemd service to ensure Ollama starts at boot.
Pulling a model but not verifying it loaded — run ollama list after pulling. A download that was interrupted leaves a partial model that will fail at inference time with a cryptic error.
Setting temperature too high for tool-calling agents — values above 0.5 cause models to hallucinate function call arguments. For tool-using agents, use temperature: 0.1 to 0.3.
Using a model too small for the task — 3B models struggle with multi-step agent chains. They lose context, repeat steps, and fail to format output correctly. 7B is the practical minimum for agent workflows.
Forgetting to open the Ollama port on a networked setup — if running Ollama on a separate machine, you must configure Ollama to listen on all interfaces (OLLAMA_HOST=0.0.0.0) and open port 11434 in the firewall.

Frequently Asked Questions

Does OpenClaw work with Ollama out of the box?

OpenClaw supports Ollama natively through its OpenAI-compatible endpoint. Set your model provider URL to http://localhost:11434/v1 and choose any model Ollama has pulled. No plugins or patches required — it works with any OpenClaw version from early 2025 onward.

Which Ollama models work best with OpenClaw agents?

Llama 3.1 8B and Mistral 7B are strong starting points for most agent tasks. For tool-calling and multi-step reasoning, Qwen2.5 14B and Llama 3.1 70B outperform smaller models. Avoid models below 7B for agent workflows — instruction-following degrades significantly below that threshold.

Can I use OpenClaw with Ollama on a machine without a GPU?

Yes, but performance will be slow. CPU-only inference on a 7B model takes 10–30 seconds per response depending on hardware. For production or interactive agents, a GPU with at least 8GB VRAM is recommended. Apple Silicon Macs run Ollama well via Metal GPU acceleration.

How do I set the Ollama endpoint in OpenClaw config?

In your agent YAML, set model_provider to openai-compatible, base_url to http://localhost:11434/v1, and model to the exact Ollama model name. Set api_key to any non-empty string — Ollama ignores it but OpenClaw requires it to be present.

Does data sent to Ollama leave my machine?

No. Ollama runs inference entirely locally. Your prompts, agent context, and responses never leave the machine. This is the primary reason teams choose Ollama — full data sovereignty with no API costs, no rate limits, and no dependency on external service availability.

How do I switch between Ollama models per agent in OpenClaw?

Each agent YAML file has its own model configuration block. Set a different model name per agent — one can use llama3.1:8b for fast drafts while another uses qwen2.5:14b for reasoning tasks. Both can point to the same Ollama server at localhost:11434.

S. Rivera

AI Infrastructure Lead

S. Rivera architects AI agent infrastructure for teams with strict data residency requirements. Has deployed OpenClaw + Ollama stacks for clients in healthcare, legal, and financial services — all running fully local inference on-premise. Specializes in making open-source LLMs production-reliable.