Pre-incident monitoring checklist
Topic: Monitoring basics
Summary
Checklist before going live: metrics, alerts, runbooks, on-call, and dashboards. Use when preparing a new service or before a launch.
Intent: Checklist
Quick answer
- Metrics for latency, errors, and saturation. Alerts with thresholds and runbooks. On-call rotation and escalation.
- Dashboard with key panels. Uptime or health check from outside. Log aggregation and retention.
- Test alert flow. Verify runbook steps. Document ownership and escalation.
Prerequisites
Steps
-
Metrics and alerts
Confirm metrics scraped or emitted. Alerts defined with runbook links. Test alert delivery.
-
Dashboards and checks
Dashboard with key panels. External health or uptime check. Logs available and retained.
-
On-call and escalation
On-call rotation set. Escalation path and ownership documented. Dry run if possible.
Summary
Before launch: metrics, alerts, runbooks, dashboards, uptime check, on-call, and escalation. Test alert flow.
Prerequisites
Steps
Step 1: Metrics and alerts
Metrics in place; alerts with runbooks; test delivery.
Step 2: Dashboards and checks
Dashboard; external check; logs and retention.
Step 3: On-call and escalation
Rotation; escalation; ownership; dry run.
Verification
- All items checked; alert test successful; runbook validated.
Troubleshooting
Missing metrics — Add scrape or instrumentation. Alert not received — Check routing and integration.