How to set up disk, CPU, and memory alerts
Topic: Monitoring basics
Summary
Define alert rules for disk space, CPU usage, and memory (or swap) so you are notified before outages. Use thresholds and hysteresis to avoid flapping. Use this when configuring a monitoring system (e.g. Prometheus and Alertmanager, or cloud monitoring).
Intent: How-to
Quick answer
- Disk: alert when filesystem usage >85% (warning) and >95% (critical). Use node_filesystem_avail_bytes / node_filesystem_size_bytes or equivalent. Alert on the mount point that matters (e.g. /, /var).
- CPU: alert when usage is high for a sustained period (e.g. 5m average >80%) to avoid brief spikes. Memory: alert when available is low (e.g. <10%) or when swap usage is growing; include node_memory_MemAvailable_bytes.
- Use for clause (e.g. for: 5m) so alert fires only after condition holds; add hysteresis (different threshold for resolve) if needed. Route alerts to email, Slack, or PagerDuty; document runbooks.
Prerequisites
Steps
-
Disk alerts
Rule: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 0.85 for 5m. Critical at 0.95. Filter by mountpoint (/, /var). Resolve when below threshold.
-
CPU and memory alerts
CPU: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for 5m. Memory: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for 5m. Swap: node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes growing or > threshold.
-
For and hysteresis
Use for: 5m so transient spikes do not alert. Optionally: warning at 80%, critical at 95%; resolve when below 75% to avoid flapping.
-
Route and runbook
Send to Alertmanager or cloud alerting; route to team channel or PagerDuty. Add runbook link or summary (e.g. 'High disk: see runbook disk-full'); document how to fix.
Summary
Define disk, CPU, and memory alert rules with thresholds and for-clause; route alerts and add runbooks. Use this to get notified before resource exhaustion.
Prerequisites
Steps
Step 1: Disk alerts
Alert when filesystem usage is above 85% (warning) and 95% (critical) for a sustained period.
Step 2: CPU and memory alerts
Alert on sustained high CPU and low available memory; optionally on swap growth.
Step 3: For and hysteresis
Use for-clause to avoid flapping; use different resolve thresholds if needed.
Step 4: Route and runbook
Route alerts to the right channel; link or describe runbooks for each alert.
Verification
- Alerts fire when the condition is met; resolve when condition clears; runbooks are available.
Troubleshooting
Alert storm — Increase for-clause or thresholds; add hysteresis. Missing alerts — Check metric names and labels; verify scrape and alert rule evaluation.