OpenClaw Heartbeat: The Agent Health Check Pros Never Skip

Q: How do I configure the heartbeat interval in OpenClaw?

Set heartbeat.intervalSeconds in gateway.yaml. The default is 30 seconds. For production deployments with strict uptime requirements, 15 seconds is common. Lower values detect failures faster but increase gateway CPU usage slightly. Match the interval to your acceptable detection latency.

Q: What happens when an agent misses a heartbeat?

After missing heartbeat.missesBeforeOffline consecutive heartbeats (default 3), the gateway marks the agent as offline, stops routing messages to it, and logs a WARNING. If a reconnect hook is configured, it fires at this point. The agent is re-marked online when it reconnects and responds to heartbeats again.

Q: Can I disable the heartbeat for local development?

Yes. Set heartbeat.enabled: false in gateway.yaml to disable heartbeat checks entirely. This is fine for local development where agents may be stopped and started frequently. Never disable heartbeats in production — you'll lose the ability to detect agent failures automatically.

Q: How do I send heartbeat status to an external monitor?

Use the heartbeat.webhookUrl configuration to POST agent status changes to an external endpoint. OpenClaw sends a JSON payload with the agent name, status (online/offline), and timestamp. This integrates with UptimeRobot, Betterstack, PagerDuty, or any webhook-capable monitoring service.

Q: Does the heartbeat affect LLM API costs?

No. Heartbeat pings are internal gateway-to-agent messages — they don't involve the LLM. The gateway checks agent process liveness directly without triggering any AI inference. Heartbeat overhead is negligible: a few bytes per interval per agent.

Key Takeaways

Heartbeat pings agents every intervalSeconds (default 30) to verify they're alive — missed pings trigger offline status and hooks
missesBeforeOffline (default 3) sets how many consecutive missed pings before an agent is marked down — lower values = faster detection
webhookUrl sends a JSON POST to your monitoring system on every status change — connects to UptimeRobot, PagerDuty, or any webhook receiver
Heartbeat pings don't involve the LLM — they're internal process checks with no API cost
Disable heartbeat only in local dev (enabled: false) — always keep it on in production deployments

An agent that's listed as "online" but not responding is worse than an agent that's visibly offline. It silently swallows messages, users get no response, and you find out from a frustrated end user three hours later. The heartbeat system prevents this. Three missed pings and the gateway marks the agent down, stops routing to it, and fires your alert hook — all automatically.

How the Heartbeat Works

The gateway maintains a persistent connection to each registered agent. Every intervalSeconds, the gateway sends a lightweight ping to each agent. The agent must acknowledge the ping within a timeout window (half the interval by default). A successful acknowledgment resets the miss counter. A non-response increments it.

When the miss counter reaches missesBeforeOffline, the gateway transitions the agent to "offline" state. In offline state, incoming messages are queued or dropped depending on your offlinePolicy setting, and the agent is excluded from load balancing if you're running multiple agent instances.

ℹ️

Heartbeat vs Reconnect

The heartbeat detects failures. Reconnection is a separate mechanism. When an agent process crashes and restarts, it reconnects to the gateway and the heartbeat counter resets automatically. Heartbeat doesn't restart agents — it only reports their status.

Configuration Reference

Heartbeat is configured in gateway.yaml under the heartbeat block:

heartbeat:
  enabled: true
  intervalSeconds: 30
  missesBeforeOffline: 3
  webhookUrl: "https://hooks.yourdomain.com/agent-status"
  offlinePolicy: queue   # queue | drop

Key fields explained:

enabled — Set false to disable heartbeat entirely. Default: true. Only disable for local development.
intervalSeconds — How often the gateway pings each agent. Default: 30. Lower values detect failures faster. Minimum recommended: 10.
missesBeforeOffline — Consecutive missed pings before offline status. Default: 3. With 30s interval, this gives 90 seconds of detection time.
webhookUrl — Optional URL that receives a POST when any agent changes status (online → offline or offline → online).
offlinePolicy — What happens to messages sent to an offline agent. queue holds them for delivery when the agent recovers. drop discards them immediately. Default: queue.

💡

Detection Time Formula

Total failure detection time = intervalSeconds × missesBeforeOffline. Default config (30s × 3) = 90 seconds. For critical agents, set intervalSeconds: 15 and missesBeforeOffline: 2 to get 30-second detection. Balance speed against false positives from transient network blips.

Webhook Alerts on Status Changes

The webhookUrl field triggers a POST every time an agent changes state. The payload looks like this:

{
  "event": "agent.offline",
  "agentName": "my-assistant",
  "timestamp": "2025-02-07T14:23:11Z",
  "consecutiveMisses": 3,
  "lastSeen": "2025-02-07T14:22:41Z"
}

For a recovery event, the event field changes to agent.online and consecutiveMisses resets to 0. This is all you need to wire up incident management: if event is agent.offline, open an incident; if agent.online, resolve it.

Here's a minimal Python webhook receiver that forwards to Slack:

from flask import Flask, request
import requests

app = Flask(__name__)
SLACK_WEBHOOK = "https://hooks.slack.com/services/..."

@app.route("/agent-status", methods=["POST"])
def agent_status():
    data = request.json
    msg = f":red_circle: *{data['agentName']}* went offline" \
          if data["event"] == "agent.offline" \
          else f":large_green_circle: *{data['agentName']}* recovered"
    requests.post(SLACK_WEBHOOK, json={"text": msg})
    return "", 200

Connecting to External Monitors

Rather than building a custom receiver, most teams route the webhookUrl to an existing monitoring service. Common integrations as of early 2025:

Betterstack (Uptime) — Use their incoming webhook URL. The agent.offline event triggers a manual incident. Agent.online closes it. Betterstack handles alerting, escalation, and on-call routing from there.
PagerDuty — Use PagerDuty's Events API v2 endpoint as webhookUrl. Map agent.offline to trigger and agent.online to resolve. PagerDuty handles deduplication automatically.
UptimeRobot — UptimeRobot doesn't have a webhook push receiver. Instead, expose the OpenClaw /api/v1/status endpoint and configure UptimeRobot to poll it on your desired interval.
Healthchecks.io — Configure the heartbeat webhookUrl to ping the healthchecks.io endpoint on agent.online events. If the ping stops arriving (because the agent went offline), healthchecks.io fires an alert after its grace period.

Tuning Heartbeat Intervals

The right interval depends on your use case:

Customer-facing agents — 15s interval, 2 misses (30s detection). Users notice silent failures quickly; fast detection matters more than false positive risk.
Background automation agents — 60s interval, 3 misses (3 min detection). These process tasks in batches; a few minutes of downtime before detection is acceptable.
Local development — Disabled. You're stopping and starting agents constantly; heartbeat noise is more annoying than useful.

Here's where most people stop. Don't. The interval also affects your queue behavior. With offlinePolicy: queue and a 90-second detection time, messages sent during a failure are queued for up to 90 seconds before the agent is marked offline. If your agent is hard-crashed, those queued messages sit in memory until the agent recovers. Set a max queue size in the queue block to prevent memory growth:

heartbeat:
  intervalSeconds: 15
  missesBeforeOffline: 2
  offlinePolicy: queue

queue:
  maxMessages: 500
  ttlSeconds: 300   # Drop queued messages after 5 minutes

Common Mistakes

Leaving heartbeat disabled after moving from local dev to production is the most common mistake I see. The config file gets copied without review and production runs without any liveness detection for months. Here's how to catch this before it bites you: after every deployment, run openclaw status --json and verify the heartbeat.enabled field is true in the output.

Setting missesBeforeOffline too low (1 or 2 with a short interval) causes false positives. A momentary network hiccup on a 10-second interval with 2 misses means any 20-second network blip triggers an offline alert. That trains teams to ignore alerts — which defeats the whole point. Use 3 misses minimum for production; reserve 2 for truly critical, low-tolerance workloads.

Not testing the webhookUrl before going live. The webhook URL is only useful if it fires. Send a test by temporarily lowering missesBeforeOffline to 1 and stopping your agent process for 15 seconds. Verify the webhook fires and your monitoring system receives it. Then restore the config.

Frequently Asked Questions

What is the OpenClaw heartbeat?

The OpenClaw heartbeat is a periodic ping the gateway sends to each connected agent to verify it's alive. If an agent misses a configured number of heartbeats, the gateway marks it offline and can trigger alerts. It's the primary liveness signal for production deployments — without it, failed agents appear online until a human notices.

How do I configure the heartbeat interval in OpenClaw?

Set heartbeat.intervalSeconds in gateway.yaml. The default is 30 seconds. For strict uptime requirements, 15 seconds is common. Lower values detect failures faster but require careful tuning of missesBeforeOffline to avoid false positives from transient network events.

What happens when an agent misses a heartbeat?

After missing heartbeat.missesBeforeOffline consecutive heartbeats (default 3), the gateway marks the agent offline, stops routing new messages to it, and logs a WARNING. If webhookUrl is configured, a POST fires immediately. The agent is re-marked online when it reconnects and responds to pings again.

Can I disable the heartbeat for local development?

Yes. Set heartbeat.enabled: false to disable all heartbeat checks. This is the right choice for local development where agents start and stop frequently. Never disable heartbeats in production — silent failures become invisible without it.

How do I send heartbeat status to an external monitor?

Set heartbeat.webhookUrl to any endpoint that accepts HTTP POST. OpenClaw sends JSON with the agent name, event type (agent.offline or agent.online), and timestamp. This integrates directly with PagerDuty, Betterstack, Slack webhooks, or any custom receiver you build.

Does the heartbeat affect LLM API costs?

No. Heartbeat pings are internal gateway-to-agent liveness checks — they don't involve any LLM inference. The gateway checks process responsiveness directly. Heartbeat overhead is negligible: a few bytes per interval per agent, no external API calls.

J. Donovan

Technical Writer

J. Donovan documents OpenClaw's operational features for production teams. Has written monitoring runbooks for OpenClaw deployments across SaaS products, internal tools, and customer-facing automation platforms with strict uptime requirements.