- Heartbeat pings agents every intervalSeconds (default 30) to verify they're alive — missed pings trigger offline status and hooks
- missesBeforeOffline (default 3) sets how many consecutive missed pings before an agent is marked down — lower values = faster detection
- webhookUrl sends a JSON POST to your monitoring system on every status change — connects to UptimeRobot, PagerDuty, or any webhook receiver
- Heartbeat pings don't involve the LLM — they're internal process checks with no API cost
- Disable heartbeat only in local dev (enabled: false) — always keep it on in production deployments
An agent that's listed as "online" but not responding is worse than an agent that's visibly offline. It silently swallows messages, users get no response, and you find out from a frustrated end user three hours later. The heartbeat system prevents this. Three missed pings and the gateway marks the agent down, stops routing to it, and fires your alert hook — all automatically.
How the Heartbeat Works
The gateway maintains a persistent connection to each registered agent. Every intervalSeconds, the gateway sends a lightweight ping to each agent. The agent must acknowledge the ping within a timeout window (half the interval by default). A successful acknowledgment resets the miss counter. A non-response increments it.
When the miss counter reaches missesBeforeOffline, the gateway transitions the agent to "offline" state. In offline state, incoming messages are queued or dropped depending on your offlinePolicy setting, and the agent is excluded from load balancing if you're running multiple agent instances.
Configuration Reference
Heartbeat is configured in gateway.yaml under the heartbeat block:
heartbeat:
enabled: true
intervalSeconds: 30
missesBeforeOffline: 3
webhookUrl: "https://hooks.yourdomain.com/agent-status"
offlinePolicy: queue # queue | drop
Key fields explained:
- enabled — Set false to disable heartbeat entirely. Default: true. Only disable for local development.
- intervalSeconds — How often the gateway pings each agent. Default: 30. Lower values detect failures faster. Minimum recommended: 10.
- missesBeforeOffline — Consecutive missed pings before offline status. Default: 3. With 30s interval, this gives 90 seconds of detection time.
- webhookUrl — Optional URL that receives a POST when any agent changes status (online → offline or offline → online).
- offlinePolicy — What happens to messages sent to an offline agent.
queueholds them for delivery when the agent recovers.dropdiscards them immediately. Default: queue.
Webhook Alerts on Status Changes
The webhookUrl field triggers a POST every time an agent changes state. The payload looks like this:
{
"event": "agent.offline",
"agentName": "my-assistant",
"timestamp": "2025-02-07T14:23:11Z",
"consecutiveMisses": 3,
"lastSeen": "2025-02-07T14:22:41Z"
}
For a recovery event, the event field changes to agent.online and consecutiveMisses resets to 0. This is all you need to wire up incident management: if event is agent.offline, open an incident; if agent.online, resolve it.
Here's a minimal Python webhook receiver that forwards to Slack:
from flask import Flask, request
import requests
app = Flask(__name__)
SLACK_WEBHOOK = "https://hooks.slack.com/services/..."
@app.route("/agent-status", methods=["POST"])
def agent_status():
data = request.json
msg = f":red_circle: *{data['agentName']}* went offline" \
if data["event"] == "agent.offline" \
else f":large_green_circle: *{data['agentName']}* recovered"
requests.post(SLACK_WEBHOOK, json={"text": msg})
return "", 200
Connecting to External Monitors
Rather than building a custom receiver, most teams route the webhookUrl to an existing monitoring service. Common integrations as of early 2025:
- Betterstack (Uptime) — Use their incoming webhook URL. The agent.offline event triggers a manual incident. Agent.online closes it. Betterstack handles alerting, escalation, and on-call routing from there.
- PagerDuty — Use PagerDuty's Events API v2 endpoint as webhookUrl. Map agent.offline to trigger and agent.online to resolve. PagerDuty handles deduplication automatically.
- UptimeRobot — UptimeRobot doesn't have a webhook push receiver. Instead, expose the OpenClaw
/api/v1/statusendpoint and configure UptimeRobot to poll it on your desired interval. - Healthchecks.io — Configure the heartbeat webhookUrl to ping the healthchecks.io endpoint on agent.online events. If the ping stops arriving (because the agent went offline), healthchecks.io fires an alert after its grace period.
Tuning Heartbeat Intervals
The right interval depends on your use case:
- Customer-facing agents — 15s interval, 2 misses (30s detection). Users notice silent failures quickly; fast detection matters more than false positive risk.
- Background automation agents — 60s interval, 3 misses (3 min detection). These process tasks in batches; a few minutes of downtime before detection is acceptable.
- Local development — Disabled. You're stopping and starting agents constantly; heartbeat noise is more annoying than useful.
Here's where most people stop. Don't. The interval also affects your queue behavior. With offlinePolicy: queue and a 90-second detection time, messages sent during a failure are queued for up to 90 seconds before the agent is marked offline. If your agent is hard-crashed, those queued messages sit in memory until the agent recovers. Set a max queue size in the queue block to prevent memory growth:
heartbeat:
intervalSeconds: 15
missesBeforeOffline: 2
offlinePolicy: queue
queue:
maxMessages: 500
ttlSeconds: 300 # Drop queued messages after 5 minutes
Common Mistakes
Leaving heartbeat disabled after moving from local dev to production is the most common mistake I see. The config file gets copied without review and production runs without any liveness detection for months. Here's how to catch this before it bites you: after every deployment, run openclaw status --json and verify the heartbeat.enabled field is true in the output.
Setting missesBeforeOffline too low (1 or 2 with a short interval) causes false positives. A momentary network hiccup on a 10-second interval with 2 misses means any 20-second network blip triggers an offline alert. That trains teams to ignore alerts — which defeats the whole point. Use 3 misses minimum for production; reserve 2 for truly critical, low-tolerance workloads.
Not testing the webhookUrl before going live. The webhook URL is only useful if it fires. Send a test by temporarily lowering missesBeforeOffline to 1 and stopping your agent process for 15 seconds. Verify the webhook fires and your monitoring system receives it. Then restore the config.
Frequently Asked Questions
What is the OpenClaw heartbeat?
The OpenClaw heartbeat is a periodic ping the gateway sends to each connected agent to verify it's alive. If an agent misses a configured number of heartbeats, the gateway marks it offline and can trigger alerts. It's the primary liveness signal for production deployments — without it, failed agents appear online until a human notices.
How do I configure the heartbeat interval in OpenClaw?
Set heartbeat.intervalSeconds in gateway.yaml. The default is 30 seconds. For strict uptime requirements, 15 seconds is common. Lower values detect failures faster but require careful tuning of missesBeforeOffline to avoid false positives from transient network events.
What happens when an agent misses a heartbeat?
After missing heartbeat.missesBeforeOffline consecutive heartbeats (default 3), the gateway marks the agent offline, stops routing new messages to it, and logs a WARNING. If webhookUrl is configured, a POST fires immediately. The agent is re-marked online when it reconnects and responds to pings again.
Can I disable the heartbeat for local development?
Yes. Set heartbeat.enabled: false to disable all heartbeat checks. This is the right choice for local development where agents start and stop frequently. Never disable heartbeats in production — silent failures become invisible without it.
How do I send heartbeat status to an external monitor?
Set heartbeat.webhookUrl to any endpoint that accepts HTTP POST. OpenClaw sends JSON with the agent name, event type (agent.offline or agent.online), and timestamp. This integrates directly with PagerDuty, Betterstack, Slack webhooks, or any custom receiver you build.
Does the heartbeat affect LLM API costs?
No. Heartbeat pings are internal gateway-to-agent liveness checks — they don't involve any LLM inference. The gateway checks process responsiveness directly. Heartbeat overhead is negligible: a few bytes per interval per agent, no external API calls.
J. Donovan documents OpenClaw's operational features for production teams. Has written monitoring runbooks for OpenClaw deployments across SaaS products, internal tools, and customer-facing automation platforms with strict uptime requirements.