- DeepSeek V3 at $0.27/million tokens is the cheapest capable cloud model for most agent tasks as of early 2025
- GPT-4o mini at $0.15/million tokens has excellent function calling support at minimal cost
- Task-based routing — sending simple tasks to cheap models and complex ones to premium models — cuts costs 60–80% without quality loss
- Local models via Ollama eliminate API costs entirely for workloads where hardware is available
- Never use a frontier model for classification, routing, or templated extraction — these tasks don't need it
Running OpenClaw on Claude 3.5 Sonnet for everything costs $60–$150/month with moderate usage. Running the right model for each task type costs $5–$15/month for the same workload. Same outputs. Different bills.
The performance gap between a $3/million-token model and a $0.15/million-token model is real — but it only matters for a small subset of tasks. Route correctly and you pay for premium reasoning only when you actually need it.
Model Cost Ranking (2025)
| Model | Input $/1M | Output $/1M | Function Calls | Best For |
|---|---|---|---|---|
| Ollama local | $0 | $0 | Varies | All tasks (with hardware) |
| GPT-4o mini | $0.15 | $0.60 | Yes | Routing, extraction, classification |
| DeepSeek V3 | $0.27 | $1.10 | Yes | Most agent tasks, coding |
| Gemini 1.5 Flash | $0.075 | $0.30 | Yes | High-volume, long context |
| Claude 3.5 Haiku | $0.80 | $4.00 | Yes | Tasks needing Claude quality |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Yes | Complex reasoning only |
DeepSeek V3: The Best Value Cloud Model
DeepSeek V3 is the model we use as a default for most agent tasks in 2025. At $0.27/million input tokens it's among the cheapest cloud options, but its performance on coding, multi-step reasoning, and tool use is genuinely competitive with models costing 10x more.
model:
provider: deepseek
name: deepseek-chat
api_key: "${DEEPSEEK_API_KEY}"
base_url: "https://api.deepseek.com/v1"
DeepSeek V3 handles function calling reliably, which matters for tool-using agents. Where it underperforms: nuanced judgment tasks that benefit from RLHF-heavy training (safety-sensitive content, complex ethical reasoning). For most builder use cases — data processing, automation, coding assistance — it delivers.
GPT-4o Mini: Best for Function-Heavy Agents
GPT-4o mini at $0.15/million input tokens is OpenAI's cheapest capable model. Function calling works reliably, latency is low, and the model follows structured output formats consistently. These properties make it the best choice for agents that execute many tool calls per session.
Sound familiar? You're paying for frontier function calling at budget-model prices. That's the value here.
model:
provider: openai
name: gpt-4o-mini
api_key: "${OPENAI_API_KEY}"
Where GPT-4o mini falls short: long-context reasoning, complex multi-step planning, and tasks requiring broad world knowledge. For a simple agent that routes tasks, calls APIs, and formats responses, it's more than adequate.
Gemini 1.5 Flash: Lowest Cost Per Token
Gemini 1.5 Flash is the cheapest option on a per-token basis at $0.075/million input tokens — half the cost of GPT-4o mini. Its 1 million token context window is a genuine advantage for agents processing large documents.
Gemini Flash's function calling is slightly less reliable than GPT-4o or DeepSeek V3 on complex nested tool calls. For simple single-tool agents (search, send email, update database), it works well. For agents with 5+ tools and complex branching logic, test thoroughly before committing.
Local Models: Zero Cost, Hardware Required
Ollama running Llama 3.1 8B or Mistral 7B costs nothing per call. The only constraints are hardware and latency.
Here's what we've seen consistently: on an Apple M2 MacBook Pro, Llama 3.1 8B generates roughly 40–60 tokens per second. Fast enough for interactive use, adequate for background automation. On a $35 Raspberry Pi 5, the same model generates 3–8 tokens per second — usable for non-latency-sensitive background tasks.
Local models worth running in 2025:
- Llama 3.1 8B — Best general-purpose local model. Good instruction following and function calling.
- Mistral 7B Instruct — Fast, reliable for structured tasks, smaller memory footprint than Llama 8B.
- Llama 3.1 70B — Near-frontier quality if you have the hardware (requires 40GB+ RAM or a 40GB+ VRAM GPU).
Task-Based Model Routing
This is where the real savings happen. Most agents perform a mix of simple and complex tasks. Simple tasks don't need a frontier model. Here's how to route in OpenClaw:
agents:
- name: router
model: gpt-4o-mini # cheap, fast, reliable routing
role: "Route incoming tasks to the correct specialist agent"
- name: analyst
model: deepseek-chat # capable, affordable for analysis
role: "Perform deep research and multi-step analysis"
- name: writer
model: claude-3-5-haiku # better writing quality
role: "Draft and edit all output content"
The routing agent on GPT-4o mini handles 80% of requests — classification, simple Q&A, formatted extraction. Only tasks that require genuine reasoning escalate to DeepSeek or Claude. This pattern consistently reduces LLM costs by 60–75% compared to running all tasks through a single premium model.
Common Mistakes
Using a frontier model for JSON extraction is the most common waste. Extracting structured data from a document — pulling fields into a schema — does not require Claude 3.5 Sonnet. GPT-4o mini or even a local 8B model does this reliably at a fraction of the cost.
Ignoring output token costs is the second mistake. Pricing tables show input costs prominently, but output tokens often cost 4–6x more per token. An agent that generates verbose outputs is burning most of its budget on the output side. Prompt your agents to be concise. Explicitly instruct them to avoid preamble and unnecessary explanation.
Not logging actual token usage means you're flying blind. Enable usage logging from day one. You'll often discover that one particular task type is consuming 60% of your total tokens — and it's almost always a task you could route to a cheaper model.
Frequently Asked Questions
What is the cheapest LLM that works well with OpenClaw?
DeepSeek V3 is the cheapest capable cloud model as of early 2025 at $0.27/million input tokens. It handles most agent tasks including tool use and multi-step reasoning. For zero cost, Ollama running Llama 3.1 8B locally is the cheapest option overall — no per-call fee, hardware required.
Is DeepSeek safe to use with sensitive data?
DeepSeek is a Chinese company and its data handling policies differ from US providers. For sensitive data — customer information, proprietary business content — use Anthropic, OpenAI, or a local model instead. DeepSeek is fine for public content processing, research tasks, and any workload that doesn't include personally identifiable or confidential information.
Can I use different models for different agent tasks?
Yes. OpenClaw's model routing lets you assign different models to different agents or even different task types within a single agent. Route simple classification to a cheap model, complex reasoning to a premium one. This is the single most effective way to reduce costs while maintaining quality where it matters.
How much cheaper is GPT-4o mini vs GPT-4o?
GPT-4o mini costs $0.15 per million input tokens versus $2.50 for GPT-4o — roughly 16x cheaper. For many agent tasks, the quality difference is minimal. GPT-4o mini struggles with very complex multi-step reasoning and tasks requiring deep domain expertise, but handles routing, summarization, and standard Q&A well.
Does using a cheaper model hurt agent reliability?
It depends on the task. Cheaper models are reliable for structured tasks with clear inputs and outputs — classification, extraction, routing, summarization. They are less reliable for open-ended reasoning, ambiguous instructions, and tasks requiring strong world knowledge. Match the model capability to the task complexity rather than applying one model across all tasks.
What's the cheapest model that supports function calling?
GPT-4o mini supports function calling at $0.15/million tokens. Gemini 1.5 Flash supports it on the free tier. DeepSeek V3 also supports function calling. All three are reliable for tool-using agents. Avoid smaller open-weight models under 7B parameters for function calling — reliability drops significantly at that scale.