Incident response checklist for Linux servers
Topic: Servers linux
Summary
When something is wrong: preserve logs and state, identify scope (one host or many), restore service or isolate, then root-cause and fix. Use this so incidents are handled consistently and evidence is kept for post-mortem.
Intent: Checklist
Quick answer
- Preserve: copy logs (journalctl, /var/log), process list (ps), network (ss); do not reboot or wipe until you have what you need for diagnosis.
- Scope: one server or many? one service or whole system? check monitoring and dependent systems; communicate to stakeholders.
- Restore: fix config, restart service, free disk, or fail over; then root-cause (logs, changes, timeline) and document post-mortem and prevention.
Prerequisites
Steps
-
Preserve evidence
journalctl -u UNIT -b > unit.log; cp -a /var/log /backup/logs-$(date +%F); ps aux > ps.txt; ss -tulnp > ss.txt; note time and last changes.
-
Determine scope
Is the app down or the whole server? Check other hosts and dependencies; check monitoring and recent deploys or config changes.
-
Restore or isolate
Restart service, free disk, revert config, or fail over to standby; if compromise suspected, isolate the host from the network and escalate.
-
Root-cause and document
Review logs and timeline; identify cause (config, resource, bug, external); write post-mortem: what happened, cause, fix, prevention; update runbooks.
Summary
You will follow an incident checklist: preserve logs and state, determine scope, restore or isolate, then root-cause and document. Use this for any outage or suspected compromise so response is consistent and repeatable.
Prerequisites
- Access to the server (SSH or console); backup and monitoring if available.
Steps
Step 1: Preserve evidence
Save logs, process list, and network state before rebooting or reconfiguring.
Step 2: Determine scope
One host or many? One service or system-wide? Check dependencies and recent changes.
Step 3: Restore or isolate
Fix and restart, or fail over; if compromise suspected, isolate and escalate.
Step 4: Root-cause and document
Analyze logs and timeline; write post-mortem and update runbooks and monitoring.
Verification
- Service restored or host isolated; evidence saved; post-mortem written and shared.
Troubleshooting
Need to reboot to restore — Snapshot or copy logs first; then reboot and continue diagnosis from saved data.