OpenClaw + Firecrawl: Web Scraping Inside Your AI Agent Pipeline

Key Takeaways

Firecrawl extracts full page content as clean markdown — not just search snippets — making it essential for deep research agents
The MCP server requires a Firecrawl API key set via the FIRECRAWL_API_KEY environment variable in your agent config
Firecrawl handles JavaScript-rendered pages using a headless browser — it sees what a real user would see
Pair it with DuckDuckGo MCP: use search to find URLs, Firecrawl to extract full content from the most relevant ones
As of early 2025, the free tier supports a limited number of scrapes per month — monitor usage to avoid hitting limits mid-pipeline

Here's where most research agents fall short: they search, get a snippet, and answer from that snippet. That's fine for surface-level questions. But when your agent needs to analyze a full report, extract data from a documentation page, or read a competitor's pricing table, snippets fail. You need the full page. Firecrawl gives you that — clean, structured, and ready for the LLM.

We've seen research agents transform from "summarizes search results" to "reads and synthesizes primary sources" the moment Firecrawl gets added to the stack. That's the difference between an agent that paraphrases and one that actually researches.

What Firecrawl Adds to Your Agent Pipeline

The distinction between search and scrape is fundamental. Search tools return a list of results with titles, URLs, and short excerpts — enough to know a page exists and roughly what it covers. Scraping tools fetch the actual content of a specific URL and return the full text, structured as clean markdown.

Most tasks need both. The agent searches to discover which pages are relevant, then scrapes the best candidates to extract the details it needs. Without scraping, your agent is always working from 200-word excerpts. With it, it reads the actual source.

Firecrawl specifically solves the JavaScript problem. Standard HTTP scrapers fetch the raw server response — for a React app or SPA, that's mostly empty. Firecrawl renders the page in a headless browser first, then extracts content. News sites, SaaS dashboards, documentation platforms — all of them work.

Prerequisites

OpenClaw installed with at least one working agent configuration
Node.js 18+ and npm on the agent host machine
A Firecrawl account and API key (sign up at firecrawl.dev)
Network access from your agent host to Firecrawl's API endpoints

Here's where most people stop. They see "API key required" and switch to DuckDuckGo. Fair — but Firecrawl's free tier is generous enough for prototyping and light production use. Get the key first, then decide.

Installing the Firecrawl MCP Server

Install the Firecrawl MCP server package globally:

npm install -g @mcptools/mcp-firecrawl

Verify installation:

mcp-firecrawl --version

You'll need your Firecrawl API key ready. Get it from your Firecrawl dashboard at firecrawl.dev after creating an account. The key looks like fc-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.

⚠️

Never Put the API Key in Your YAML File

Your agent config YAML will likely end up in version control. If you hardcode the Firecrawl API key there, it gets committed and potentially exposed. Always pass it via the env block in your MCP server config, which reads from the host environment — not from a value embedded in the file.

Configuring Your Agent to Use Firecrawl

Add the Firecrawl MCP server to your agent config. The key difference from DuckDuckGo is the env block where you pass the API key from the host environment:

agent:
  name: research-agent
  model: claude-3-5-sonnet
  system_prompt: |
    You are a deep research assistant. Use duckduckgo_search to find
    relevant pages, then use firecrawl_scrape to read the full content
    of the most important ones. Always cite the URLs you scraped.

mcp_servers:
  - name: duckduckgo
    command: mcp-duckduckgo
    args: ["--max-results", "5"]
    env: {}
    tools:
      - duckduckgo_search

  - name: firecrawl
    command: mcp-firecrawl
    args: []
    env:
      FIRECRAWL_API_KEY: "${FIRECRAWL_API_KEY}"
    tools:
      - firecrawl_scrape
      - firecrawl_crawl

Set the environment variable on your host before starting OpenClaw:

# Linux / macOS
export FIRECRAWL_API_KEY=fc-your-key-here

# Windows (Command Prompt)
set FIRECRAWL_API_KEY=fc-your-key-here

# Windows (PowerShell)
$env:FIRECRAWL_API_KEY="fc-your-key-here"

Understanding the Two Firecrawl Tools

The MCP server exposes two tools. Know the difference before writing your system prompt:

firecrawl_scrape — fetches a single URL and returns its full content as markdown. Use this when you know exactly which page you need.
firecrawl_crawl — starts at a URL and follows links to a specified depth, returning content from multiple pages. Use this for documentation sites, wikis, or any multi-page source you need to explore fully.

For most research agents, firecrawl_scrape is the right default. Crawling consumes significantly more API credits and token budget — reserve it for cases where you explicitly need multi-page coverage.

💡

Cap Crawl Depth to Control Costs

When using firecrawl_crawl, always instruct the agent to pass a maxDepth parameter of 1 or 2. An uncapped crawl on a large documentation site can consume hundreds of API credits in a single tool call. Add "never crawl deeper than 2 levels unless explicitly asked" to your system prompt.

The Search-Then-Scrape Pattern

The most effective pattern for research agents combines both tools in sequence. Here's how it works in practice, and what the agent logs look like:

# User message:
"What are the actual pricing tiers for Anthropic's API as of this month?"

# Agent tool call sequence:
[Tool: duckduckgo_search]
  query: "Anthropic API pricing 2025"
  → 5 results returned

[Tool: firecrawl_scrape]
  url: "https://www.anthropic.com/pricing"
  → 2,400 words of markdown content returned

# Agent synthesizes from full page content
# Answer includes actual pricing table data, not a snippet

This two-step pattern is what separates agents that answer questions from agents that research them. The search step takes milliseconds. The scrape step takes 2–5 seconds depending on page complexity. The result is an answer grounded in primary source content.

We'll get to the system prompt engineering that makes this pattern reliable in a moment — but first understand why most builders get this wrong.

The failure mode is letting the agent decide whether to scrape. Without explicit instruction, most LLMs will answer from the search snippet alone if it looks sufficient. Add explicit rules: "When a question requires specific data, prices, or technical details, always scrape the source page before answering."

Common Mistakes That Break the Integration

Hardcoding the API key in YAML — gets committed to git, exposes credentials. Always use environment variable interpolation in the env block.
Relying on the agent to decide when to scrape — LLMs are optimistic about snippets. Give the agent explicit rules about when scraping is required.
Using firecrawl_crawl for single-page lookups — crawling costs 5–10x more credits per task than single-page scraping. Match the tool to the task.
Not handling scrape failures gracefully — some pages block scrapers. Tell the agent in the system prompt: "If firecrawl_scrape fails, note that the page couldn't be accessed and explain what you found from search results only."
Ignoring token budget — a single scraped page can be 3,000–8,000 tokens. With multiple scrapes per session, context windows fill fast. Instruct the agent to summarize scraped content before storing it in shared memory.

Frequently Asked Questions

What does Firecrawl do that DuckDuckGo search doesn't?

DuckDuckGo returns search result snippets — titles, URLs, and short excerpts. Firecrawl fetches the full content of a specific URL as clean markdown. Use DuckDuckGo to find relevant pages, then Firecrawl to extract complete content from the most important ones for deep analysis.

Does Firecrawl MCP handle JavaScript-rendered pages?

Yes. Firecrawl uses a headless browser to render pages before extraction, handling React apps, SPAs, and JS-heavy sites that basic HTTP scrapers miss entirely. You get what a real browser sees — not just the raw server response, which is often nearly empty for modern web apps.

Do I need a Firecrawl API key to use it with OpenClaw?

Yes, a Firecrawl API key is required. Sign up at firecrawl.dev to get one. Pass the key via the FIRECRAWL_API_KEY environment variable in your MCP server config — never hardcode it in your agent YAML file where it could be exposed in version control.

Can Firecrawl scrape pages behind a login?

Not directly through the standard MCP integration. Firecrawl supports authentication flows in its cloud offering, but the MCP server exposes the public scraping interface only. For pages requiring login, you need a custom scraper with authenticated sessions or cookies passed explicitly.

How do I prevent the agent from scraping the same URL multiple times?

Add instructions to your system prompt telling the agent to track scraped URLs in the current session. You can also write them to OpenClaw shared memory as a visited list — the agent checks memory before calling Firecrawl, preventing redundant scrapes and unnecessary API credit usage.

What output format does Firecrawl return to the agent?

By default, Firecrawl returns page content as clean markdown — headings, paragraphs, lists, and code blocks preserved without HTML noise. This format is ideal for LLM consumption. You can also request raw HTML or structured JSON depending on the MCP server version and your specific configuration needs.

R. Nakamura

Developer Advocate

R. Nakamura builds integrations between OpenClaw and the broader developer toolchain. Has connected OpenClaw deployments to Zapier, n8n, custom Python backends, and enterprise workflow platforms. Deep experience with MCP tool chaining and research agent architectures across finance and media verticals.