Incident triage (when an alert fires)
Topic: Monitoring basics
Summary
When an alert fires, triage quickly: confirm the alert is real, identify scope and impact, and start the right runbook or escalation. Use this as the standard process for handling monitoring alerts and reducing MTTR.
Intent: How-to
Quick answer
- Confirm: check the current state (dashboard, run a command). Is the metric still in alert state? Could it be a flapping or stale alert? If confirmed, acknowledge and start triage.
- Scope: which service, which host, which region? Is it one instance or many? Check dependent alerts (e.g. disk full may cause multiple service failures). Use runbook for this alert type.
- Act: follow runbook (restart, scale, clear disk, failover). If no runbook, contain impact (e.g. rollback, disable feature) and escalate if needed. Document what you did and what you found for post-incident.
Prerequisites
Steps
-
Confirm alert
Open dashboard or run a quick check (e.g. ssh host 'df -h'). Is the condition still true? If not, may be flapping; tune alert or add hysteresis. If true, acknowledge and proceed.
-
Determine scope
Which hosts, which services? Check related alerts (same host, same cluster). Identify user impact (e.g. all users vs one region). Use runbook that matches alert type.
-
Execute runbook
Follow the runbook for this alert: e.g. clear disk, restart service, failover. If runbook is missing or does not fit, contain (rollback, disable) and escalate. Do not guess at fixes without a quick sanity check.
-
Document and follow up
Log what you did and what you found. After resolution, update runbook if needed; create post-incident item if the cause was unknown or the fix was one-off. Reduce noise (tune alert) if false positive.
Summary
Confirm the alert, determine scope, run the runbook (or contain and escalate), and document. Use this to handle alerts consistently and to reduce MTTR.
Prerequisites
Steps
Step 1: Confirm alert
Verify the condition is still true; avoid acting on stale or flapping alerts.
Step 2: Determine scope
Identify affected hosts and services and user impact; use the right runbook.
Step 3: Execute runbook
Follow the runbook; contain and escalate if the runbook does not apply.
Step 4: Document and follow up
Log actions and findings; update runbooks and tune alerts after the incident.
Verification
- Alerts are confirmed before action; runbooks are used; incidents are documented and improved.
Troubleshooting
No runbook — Create one during or after the incident; use a generic triage template. Too many alerts — Group or deduplicate; tune thresholds and for-clause; fix root cause so the condition stops recurring.