Incident triage (when an alert fires)

Topic: Monitoring basics

Summary

When an alert fires, triage quickly: confirm the alert is real, identify scope and impact, and start the right runbook or escalation. Use this as the standard process for handling monitoring alerts and reducing MTTR.

Intent: How-to

Quick answer

  • Confirm: check the current state (dashboard, run a command). Is the metric still in alert state? Could it be a flapping or stale alert? If confirmed, acknowledge and start triage.
  • Scope: which service, which host, which region? Is it one instance or many? Check dependent alerts (e.g. disk full may cause multiple service failures). Use runbook for this alert type.
  • Act: follow runbook (restart, scale, clear disk, failover). If no runbook, contain impact (e.g. rollback, disable feature) and escalate if needed. Document what you did and what you found for post-incident.

Prerequisites

Steps

  1. Confirm alert

    Open dashboard or run a quick check (e.g. ssh host 'df -h'). Is the condition still true? If not, may be flapping; tune alert or add hysteresis. If true, acknowledge and proceed.

  2. Determine scope

    Which hosts, which services? Check related alerts (same host, same cluster). Identify user impact (e.g. all users vs one region). Use runbook that matches alert type.

  3. Execute runbook

    Follow the runbook for this alert: e.g. clear disk, restart service, failover. If runbook is missing or does not fit, contain (rollback, disable) and escalate. Do not guess at fixes without a quick sanity check.

  4. Document and follow up

    Log what you did and what you found. After resolution, update runbook if needed; create post-incident item if the cause was unknown or the fix was one-off. Reduce noise (tune alert) if false positive.

Summary

Confirm the alert, determine scope, run the runbook (or contain and escalate), and document. Use this to handle alerts consistently and to reduce MTTR.

Prerequisites

Steps

Step 1: Confirm alert

Verify the condition is still true; avoid acting on stale or flapping alerts.

Step 2: Determine scope

Identify affected hosts and services and user impact; use the right runbook.

Step 3: Execute runbook

Follow the runbook; contain and escalate if the runbook does not apply.

Step 4: Document and follow up

Log actions and findings; update runbooks and tune alerts after the incident.

Verification

  • Alerts are confirmed before action; runbooks are used; incidents are documented and improved.

Troubleshooting

No runbook — Create one during or after the incident; use a generic triage template. Too many alerts — Group or deduplicate; tune thresholds and for-clause; fix root cause so the condition stops recurring.

Next steps

Continue to