Health Monitoring
Define health probes for your services and get alerted when something breaks.
Probe Types
| Type | Target Format | What It Checks |
|---|---|---|
http | URL | HTTP status code, response body, latency |
port | host:port | TCP connectivity |
command | Shell command | Exit code matches expected (default: 0) |
file | File path | File exists and is not older than max_age_secs |
Configuration
config.toml
[health]
enabled = true
tick_interval_secs = 30
result_retention_days = 7
[[health.probes]]
name = "API Server"
probe_type = "http"
target = "https://api.example.com/health"
schedule = "every 5m"
consecutive_failures_alert = 3
latency_threshold_ms = 2000
alert_session_ids = ["123456789"]
[[health.probes]]
name = "Database"
probe_type = "port"
target = "localhost:5432"
schedule = "every 1m"HTTP Probe Options
| Key | Type | Default | Description |
|---|---|---|---|
timeout_secs | integer | 10 | Request timeout in seconds |
expected_status | integer | 200 | Expected HTTP status code |
expected_body | string | null | Expected substring in response body |
method | string | "GET" | HTTP method |
headers | object | {} | Custom HTTP headers |
Alerting
When a probe fails consecutive_failures_alert times in a row, an alert is sent to all session IDs in alert_session_ids.
Background Tasks
- Tick loop — runs every
tick_interval_secs(default 30), executes due probes - Cleanup — runs at 3:40 AM UTC, removes old results
Dynamic Probes
Probes can also be created at runtime by the agent via the
health_probe tool.