Incident Response for Websites: A Lightweight Runbook You Can Actually Use
Incidents Happen — Preparedness Is What Separates Recovery from Chaos
No website is immune to incidents. Servers crash, deployments go wrong, databases corrupt, and attacks succeed despite your best defenses. The difference between a twenty-minute recovery and a twelve-hour disaster usually comes down to one thing: whether the team had a plan before the incident started. A runbook — a documented set of procedures for detecting, triaging, communicating, and recovering from incidents — turns panic into process.
This guide provides a lightweight, practical incident response runbook designed for hosting customers and small teams. It is not a fifty-page enterprise playbook. It is the minimum effective set of procedures that help you respond quickly, communicate clearly, and recover fully.
Phase 1: Detection
You cannot respond to an incident you do not know about. Detection is the first phase, and it depends on two things: automated monitoring and clear escalation paths.
Automated Monitoring
Your monitoring system should alert you to problems before your customers notice them. At minimum, monitor HTTP availability, response time (TTFB and latency percentiles), error rates (5xx responses), SSL certificate validity, and DNS resolution. Alerts should fire through channels your team actively monitors — messaging platforms, email, SMS, or push notifications. Configure alerts with sensible thresholds to minimize false positives while catching real issues promptly.
Customer Reports
Sometimes customers detect problems before monitoring does — especially for issues that affect specific user flows or regions. Have a clear channel for customers to report problems (support tickets, status page, email) and ensure those reports reach the team responsible for incident response quickly.
Declaring an Incident
Not every alert is an incident. An incident is an unplanned event that causes or risks causing significant impact to users. Define what "significant" means for your business: full site outage, degraded performance affecting more than a certain percentage of users, security breach, or data loss. When the criteria are met, declare an incident and move to triage.
Phase 2: Triage
Triage is about understanding the scope and severity of the incident so you can prioritize the response correctly.
Severity Levels
Define severity levels in advance. A common framework:
- Critical: Full site outage, data breach, or complete loss of a core business function. All hands on deck, immediate response required.
- Major: Significant degradation affecting many users, but the site is partially functional. Urgent response, but not every team member needs to drop everything.
- Minor: Limited impact affecting a small number of users or a non-critical function. Respond within business hours.
Initial Assessment
Ask these questions quickly:
- What is broken? (Website down, specific feature broken, performance degraded, security breach?)
- How many users are affected? (All users, specific region, specific browser/device?)
- When did it start? (Check monitoring timestamps, deployment logs, change history.)
- What changed recently? (Deployment, configuration change, DNS update, hosting provider maintenance?)
The last question — "what changed?" — resolves the majority of incidents. Most problems are caused by recent changes, and identifying the change often points directly to the fix.
Phase 3: Communication
During an incident, clear communication is as important as the technical fix. Three audiences need information: your team, your customers, and your stakeholders.
Internal Communication
Designate a communication channel (a dedicated messaging channel, a war room, a conference call) where everyone involved in the incident shares updates. Keep the channel focused — troubleshooting happens here, not casual conversation. Assign roles: who is leading the technical investigation, who is handling customer communication, who is making decisions about escalation.
Customer Communication
Post a status update as soon as you know there is an incident, even if you do not yet know the cause. Customers prefer "We are aware of an issue and investigating" over silence. Update the status page at regular intervals (every 15-30 minutes for critical incidents) with what you know, what you are doing, and when you expect the next update.
Be honest and factual. Avoid technical jargon in customer-facing messages. Do not make promises you cannot keep — "we expect resolution within the hour" is only appropriate if you have genuine confidence in that timeline.
Stakeholder Communication
Leadership and business stakeholders need a summary: what is happening, what is the business impact, and what is the team doing about it. Keep these updates concise and focused on impact and timeline rather than technical details.
Phase 4: Resolution
Resolution is the technical work of fixing the incident. The approach depends on the cause:
Rollback
If the incident was caused by a deployment, rollback is usually the fastest resolution. Revert to the previous version of the application code, database migration, or configuration change. This is why having a tested rollback procedure is essential — during an incident is not the time to figure out how to revert a deployment.
Restart
For issues caused by resource exhaustion (memory leaks, connection pool depletion, disk space), restarting the affected service often provides immediate relief. This is a temporary fix — the underlying cause still needs investigation — but it restores service quickly while the team investigates.
Failover
If you have redundant infrastructure (a standby server, a read replica that can be promoted, a CDN that can serve cached content while the origin is down), failover is the fastest path to restoring service. The failed component can be investigated and repaired offline.
Hotfix
When the cause is a specific code bug or configuration error, a targeted fix deployed directly to production may be appropriate. Hotfixes should be minimal, focused changes — not an opportunity to bundle other improvements. Test the hotfix on staging (or at least a local environment) before deploying to production.
Phase 5: Recovery and Verification
After the fix is in place, verify that the incident is genuinely resolved:
- Check monitoring: are uptime, response times, and error rates back to normal?
- Test critical user flows: can users log in, view content, make purchases, submit forms?
- Verify from multiple locations: is the fix effective globally, or only in your local cache?
- Monitor closely for the next few hours: some fixes mask the root cause, and the problem may recur.
Once you are confident that service is fully restored, update the status page, notify customers and stakeholders, and close the incident.
Phase 6: Postmortem
The postmortem is where learning happens. Within 48 hours of resolving the incident, write a brief document that covers:
- Timeline: What happened, when, and in what sequence.
- Root cause: Why did the incident happen? Trace it to the underlying cause, not just the immediate trigger.
- Impact: How many users were affected? What was the duration? What was the business impact?
- What went well: What helped during the response? Effective monitoring, quick communication, reliable backups?
- What could improve: What slowed the response? Missing runbooks, unclear ownership, inadequate monitoring?
- Action items: Specific, assigned tasks that reduce the likelihood or impact of similar incidents. These are not vague intentions — they are concrete tasks with owners and deadlines.
Keep postmortems blameless. The goal is to improve systems and processes, not to assign fault to individuals. Blame drives hiding; learning drives improvement.
Maintaining the Runbook
A runbook is only useful if it is current, accessible, and practiced. Store it where your team can find it during an emergency — a shared document, a wiki, a pinned message in your incident channel. Review it quarterly. Update it after every incident that reveals a gap. Practice the procedures periodically — a tabletop exercise where you simulate an incident and walk through the runbook costs very little time and reveals weaknesses before a real incident exposes them.
Getting Started
You do not need a perfect runbook to start. Document the basics: who to contact, how to check monitoring, how to rollback a deployment, how to restart services, and where to post status updates. That alone puts you ahead of most teams. Refine it after each incident, and over time you will have a robust, battle-tested playbook that makes incidents manageable instead of catastrophic.