- No GPU required for cloud API use — OpenClaw's gateway and agent runtime run fine on any modern CPU with 4 GB RAM
- Local model inference via Ollama requires VRAM: 8 GB for 7B models, 16 GB for 13B models, 40 GB+ for 70B models
- CPU fallback works but generates 3–8 tokens/sec on 7B models — compare that to 50–120 tokens/sec on a mid-range GPU
- macOS uses Metal, NVIDIA uses CUDA, AMD on Linux uses ROCm — all are supported through Ollama
- For a $500–$800 budget, the RTX 4060 Ti 16 GB gives the best VRAM-per-dollar ratio for local model work
OpenClaw running against GPT-4 or Claude needs zero GPU. Full stop. The question only becomes relevant the moment you want to run local models — Llama 3, Mistral, Qwen, DeepSeek — through Ollama. And the minute you go local, VRAM becomes the only number that matters.
When a GPU Is Required vs. Optional
The OpenClaw gateway itself is a Node.js process. It handles routing, memory, channel management, and tool execution. It uses system RAM, not VRAM, and its footprint is small — 200–400 MB of RAM under normal load. A Raspberry Pi can technically run the gateway. The GPU question is entirely about the model backend.
Two model backend paths exist in a typical OpenClaw setup:
- Cloud API (OpenAI, Anthropic, Groq, Together) — inference happens on the provider's servers. Your machine sends the prompt over HTTPS and receives the response. Zero local GPU needed. This is the default path for most new OpenClaw users.
- Local model via Ollama — inference runs on your hardware. Ollama manages the model loading, quantization, and hardware acceleration. This is where GPU requirements apply.
Sound familiar? You've probably seen rigs advertised as "AI-ready" that you don't need for cloud API work. Save that budget for local model hardware if you actually need it.
The cleanest path is to get OpenClaw running against a cloud API first, then add Ollama for local models once you've validated your agent setup. This separates the hardware problem from the software configuration problem — easier to debug both.
VRAM Requirements by Model Size
VRAM is the hard constraint for local model inference. The entire model — or the active layers of it — must fit in VRAM to get GPU-accelerated speeds. Overflow spills to system RAM and the performance drops off a cliff. Here's the practical VRAM picture by model family, using Q4_K_M quantization (the best quality-to-size tradeoff for most use cases as of early 2025).
| Model Size | VRAM (Q4_K_M) | Minimum Card | Recommended Card |
|---|---|---|---|
| 3B | 2–3 GB | GTX 1060 6 GB | RTX 3060 8 GB |
| 7B | 6–8 GB | RTX 3060 8 GB | RTX 4060 Ti 16 GB |
| 13B | 9–12 GB | RTX 3080 10 GB | RTX 4080 16 GB |
| 34B | 20–24 GB | RTX 3090 24 GB | RTX 4090 24 GB |
| 70B | 40–48 GB | 2× RTX 3090 | A6000 48 GB or M2 Ultra |
The 70B row is where most people hit the wall. Running Llama 3 70B locally requires either a professional GPU with 48 GB VRAM, two 3090s bridged together, or Apple Silicon with unified memory (which handles this case differently — more on that in the Apple Silicon guide). For most teams, 13B models hit the sweet spot of capability versus hardware cost.
Here's where most people stop — right before making the mistake of buying 8 GB VRAM thinking they'll "just run quantized 13B." The math doesn't work. A 13B model at Q4_K_M needs 9.5 GB minimum. An 8 GB card will spill layers to system RAM and you'll get CPU-like speeds on a GPU you paid $400 for.
CPU Fallback: The Real Performance Numbers
Ollama falls back to CPU inference when no compatible GPU is found or when the model exceeds available VRAM. CPU inference works — the output is correct — but the speed is painful for interactive use cases.
Here's what we've measured consistently on modern hardware in early 2025:
- 7B model on 8-core CPU (AMD Ryzen 7 / Intel Core i7): 4–8 tokens per second
- 7B model on RTX 4060 (8 GB VRAM): 60–80 tokens per second
- 7B model on RTX 4090 (24 GB VRAM): 120–150 tokens per second
- 13B model, CPU-only: 2–4 tokens per second
At 5 tokens per second, a 500-word response takes roughly 2 minutes. For a background batch job running overnight, that's acceptable. For an interactive agent handling Telegram messages, it's not. Know your workload before choosing hardware.
If your model barely fits in VRAM, Ollama may load it partially into VRAM and partially into system RAM. Performance in this state is often worse than pure CPU inference because of the constant VRAM-to-RAM transfer overhead. Confirm full VRAM fit using ollama ps which shows GPU vs CPU layer distribution.
Metal vs CUDA vs ROCm: Which Acceleration Stack
Ollama — which powers local model inference in OpenClaw setups — supports three hardware acceleration backends. Which one applies depends entirely on your hardware and OS.
NVIDIA CUDA (Windows, Linux)
CUDA is the most mature local inference stack. NVIDIA GPUs from the GTX 1000 series through the RTX 4000 series all support CUDA inference through Ollama. Drivers must be current — Ollama requires CUDA 11.8 or newer. The RTX 3000 and 4000 series deliver the best token throughput per dollar for CUDA inference.
# Verify CUDA is active in Ollama
ollama run llama3
# Then in another terminal:
ollama ps
# Look for "GPU layers: X/Y" — all layers should be on GPU
Apple Metal (macOS)
On Apple Silicon Macs, Ollama uses the Metal GPU framework to accelerate inference. The unique advantage is unified memory — the GPU and CPU share the same memory pool, so a Mac with 96 GB of unified memory can run a 70B model without any GPU/CPU split penalty. More on this in the Apple Silicon guide.
AMD ROCm (Linux)
ROCm support arrived in Ollama in 2024 and works well on Linux with RX 6000 and RX 7000 series cards. Windows ROCm support is still experimental as of early 2025. If you're building on AMD, use Linux — Ubuntu 22.04 is the best-supported platform for ROCm. Performance on RDNA3 cards (RX 7900 XTX at 24 GB VRAM) is competitive with mid-range NVIDIA options.
Hardware Recommendations by Budget
These picks are based on what we've actually tested with OpenClaw and Ollama in early 2025 — not spec-sheet comparisons.
- Under $300: RTX 3060 12 GB (used) — runs 7B models comfortably, 8 GB headroom. The extra 4 GB vs the 8 GB variant matters more than the architecture generation.
- $400–$600: RTX 4060 Ti 16 GB — the best card for 7B and light 13B work at this price. 16 GB VRAM is the key differentiator.
- $700–$900: RTX 4070 Super 12 GB — faster than the 4060 Ti but less VRAM. Prioritize VRAM over compute for model work.
- $1,000+: RTX 4080 Super 16 GB or RX 7900 XTX 24 GB — solid 13B territory, starts touching 34B with heavy quantization.
- No GPU needed: Mac Mini M4 Pro (48 GB) — handles 70B models with unified memory. Different budget category but different capability entirely.
Common GPU Setup Mistakes
- Buying 8 GB VRAM for 13B models — the VRAM math doesn't work. 13B at Q4 needs 9–10 GB minimum. Either go 16 GB or stay on 7B models.
- Not checking Ollama GPU utilization — run
ollama psafter loading a model. If GPU layers is less than total layers, you're spilling to RAM. - Using ROCm on Windows — AMD GPU support on Windows through Ollama is unreliable in early 2025. Use Linux for AMD inference.
- Running multiple large models simultaneously — each loaded model occupies VRAM. Running two 7B models on an 8 GB card causes constant swapping. Stick to one model loaded at a time.
- Ignoring context window VRAM cost — a 7B model at 8k context uses more VRAM than the same model at 2k context. Factor in your typical context window when calculating VRAM requirements.
Frequently Asked Questions
Do I need a GPU to run OpenClaw?
No GPU is required if you use a cloud API like OpenAI or Anthropic — OpenClaw routes those requests over the network. A GPU is only needed when running local models via Ollama. Without one, local inference falls back to CPU, which works but is significantly slower for models above 7B parameters.
How much VRAM do I need for a 7B model?
A quantized 7B model at Q4_K_M fits in 6–8 GB of VRAM. 8 GB is the practical minimum for comfortable headroom. The RTX 3060 12 GB and RTX 4060 8 GB are common entry-level choices, with the 12 GB variant giving better breathing room for longer context windows.
Can I run a 13B model on 8 GB VRAM?
Only with heavy quantization at Q3 or lower. A 13B model at Q4_K_M needs roughly 9–10 GB, so it won't fit on 8 GB without quality degradation. A 16 GB VRAM card like the RTX 4080 or RX 7900 XTX is the right match for 13B models at reasonable quality settings.
Does OpenClaw support AMD GPUs?
Yes, via ROCm on Linux. Ollama added ROCm support in 2024 and it works well with RX 6000 and RX 7000 series cards. Windows ROCm support is more limited — Linux is strongly recommended for AMD GPU inference. RDNA2 and RDNA3 architectures are the best-supported AMD options.
What is CPU fallback mode and how slow is it?
CPU fallback runs local model inference on your processor. A 7B model on a modern 8-core CPU generates 3–8 tokens per second — usable for overnight batch jobs but frustrating for interactive use. A mid-range GPU running the same model generates 50–120 tokens per second. For interactive agents, GPU inference is necessary.
Does VRAM matter for OpenClaw's gateway and agent runtime?
No. OpenClaw's gateway and agent runtime are lightweight Node.js processes that use system RAM, not VRAM. VRAM only matters for the local model backend via Ollama. The gateway itself runs fine on any machine with 4 GB of system RAM regardless of GPU presence or absence.
T. Chen has benchmarked local model inference across NVIDIA, AMD, and Apple Silicon hardware for production OpenClaw deployments. Built and tuned agent rigs ranging from single-GPU workstations to multi-node inference clusters, with particular focus on cost-effective VRAM utilization for small teams.