Disaster Recovery for Hosting: RPO/RTO Targets and Point-in-Time Recovery

System AdminApril 13, 2023254 views6 min read

Disaster Recovery Is Not a Backup Strategy — It Is a Business Continuity Strategy

Backups protect your data. Disaster recovery protects your business. The distinction matters. A backup tells you that your data exists somewhere safe. Disaster recovery tells you exactly how long it takes to restore that data, bring your services back online, and resume normal operations. Without a tested disaster recovery plan, a backup is just a file on a disk — potentially useful, but with no guaranteed outcome.

This guide covers disaster recovery for hosting customers: defining RPO and RTO targets based on business impact, implementing Point-in-Time Recovery (PITR), building recovery runbooks, and testing your plan so it works when it matters most.

Defining RPO and RTO: The Business Conversation

Recovery Point Objective (RPO)

RPO defines the maximum acceptable data loss measured in time. An RPO of one hour means you can afford to lose up to one hour of data. An RPO of zero means no data loss is acceptable — every transaction must be recoverable.

RPO is a business decision, not a technical one. Ask stakeholders: if we lose the last hour of data, what is the business impact? The last day? The answer determines your backup frequency and replication strategy. An e-commerce store processing continuous orders has a very different RPO requirement than a documentation site updated weekly.

Recovery Time Objective (RTO)

RTO defines the maximum acceptable downtime — from the moment of failure to full service restoration. An RTO of four hours means you commit to being back online within four hours. An RTO of fifteen minutes demands sophisticated automation and pre-staged infrastructure.

Like RPO, RTO is a business decision. Calculate the cost of downtime: lost revenue, customer trust damage, SLA penalties, and operational disruption. Compare that cost against the investment needed to achieve shorter RTOs. A fifteen-minute RTO costs significantly more than a four-hour RTO in terms of infrastructure and automation.

Disaster Scenarios for Hosting Customers

Design your DR plan around realistic scenarios:

  • Server failure: Hardware crash, hypervisor failure, or data center outage. Your primary server becomes unreachable.
  • Data corruption: A bad deployment, a rogue query, or a software bug corrupts database data. The server is running fine, but the data is wrong.
  • Security breach: An attacker gains access and encrypts data (ransomware), deletes records, or modifies content.
  • Human error: Accidental deletion of a database table, misconfigured DNS, or an erroneous deployment that cannot be rolled back.
  • Provider failure: Your hosting provider experiences a prolonged outage affecting your servers.

Each scenario may require a different recovery approach. Server failure requires failover to a standby. Data corruption requires restoring to a point before the corruption. Human error requires either point-in-time recovery or targeted data restoration.

Point-in-Time Recovery (PITR)

PITR is the most powerful recovery capability for database-driven applications. It allows you to restore your database to any specific moment in time, not just the time of the last backup. This is critical for data corruption scenarios where you need to recover to the moment before the corruption occurred.

How PITR Works

PITR relies on two components: a base backup (a full copy of the database) and a continuous archive of Write-Ahead Log (WAL) segments (the transaction log). To restore, you start from the base backup and replay WAL segments up to the target point in time. The result is the database state at that exact moment.

Setting Up PITR for PostgreSQL

  1. Enable WAL archiving by setting archive_mode = on and configuring archive_command to copy WAL segments to archive storage.
  2. Take regular base backups using pg_basebackup.
  3. Store both base backups and WAL archives in durable, offsite storage.
  4. Document the restore procedure and test it.

PITR for MySQL

MySQL provides similar capabilities through binary logging. Enable binary logs, take regular full backups with mysqldump or a physical backup tool, and retain binary logs between backups. To restore, import the full backup and replay binary logs up to the target point using mysqlbinlog.

Building the Recovery Runbook

A disaster recovery runbook is a step-by-step guide that anyone on your team can follow to restore service. It should be detailed enough that someone who has never performed the recovery can execute it under pressure.

Contents of a Good Runbook

  • Decision tree: What type of disaster is this? (Server failure, data corruption, security breach, human error?) The answer determines which recovery procedure to follow.
  • Contact list: Who needs to be notified? Hosting provider support, team leads, on-call engineers, stakeholders.
  • Step-by-step procedures: For each disaster type, detailed steps for recovery. Include the exact commands, file paths, credentials locations, and verification steps.
  • Verification checklist: After recovery, how do you confirm that everything is working? Check pages, test critical functions, verify data integrity, confirm monitoring is active.
  • Communication templates: Pre-written status page updates, customer notifications, and stakeholder briefings for different scenarios.

Infrastructure for Fast Recovery

Warm Standby

A warm standby is a secondary server that receives replicated data from the primary but does not serve traffic. When the primary fails, the standby is promoted to primary, and DNS is updated to point to it. This provides an RTO of minutes to tens of minutes, depending on the promotion process and DNS TTL.

Hot Standby

A hot standby receives replicated data and can serve read traffic (read replicas). When the primary fails, the hot standby is promoted to read-write and takes over immediately. Combined with automated failover detection and DNS switching, this achieves RTO of under a minute.

Cold Recovery

Restoring from a backup to a freshly provisioned server. The slowest approach, but the simplest and cheapest. RTO depends on backup size, restore speed, and the time to provision and configure a new server. For small to mid-size databases, cold recovery typically takes one to four hours.

Testing Your DR Plan

An untested DR plan is a collection of assumptions. Test regularly:

Tabletop Exercises

Walk through a disaster scenario verbally as a team. Discuss who does what, identify gaps in the runbook, and verify that contact information and access credentials are current. Low effort, high value. Do this quarterly.

Restore Drills

Execute the full recovery procedure on a test environment. Restore from a backup, verify data integrity, and measure the actual RTO. Do this at least twice a year.

Failover Tests

If you have a standby or replica, test the promotion process. Simulate a primary failure and execute the failover. Verify that the promoted server serves traffic correctly and that no data was lost. Do this at least annually.

Document Everything

After each test, document what worked, what failed, and what needs improvement. Update the runbook based on findings. Track actual RTO and RPO against targets and adjust your infrastructure or procedures if you are not meeting them.

RPO/RTO Target Guide

  • Personal blogs, low-traffic sites: RPO: 24 hours. RTO: 8-24 hours. Daily backups, manual recovery.
  • Business websites, moderate traffic: RPO: 1-4 hours. RTO: 1-4 hours. Frequent backups or PITR, documented recovery procedures.
  • E-commerce, SaaS applications: RPO: minutes. RTO: 15-60 minutes. PITR, warm or hot standby, automated failover.
  • Mission-critical, high-transaction: RPO: zero (synchronous replication). RTO: under 5 minutes. Hot standby with automated failover and health monitoring.

The Bottom Line

Disaster recovery is about ensuring your business survives the worst day. Define RPO and RTO based on real business impact. Implement PITR for precise data recovery. Build runbooks that work under pressure. Test everything — tabletop exercises, restore drills, failover tests. The time to discover that your recovery plan has a fatal flaw is during a test, not during a disaster. Invest in recovery before you need it, and the day you do need it will be manageable instead of catastrophic.

MySQLWordPressBackupLinuxPostgreSQL