Postgres Backup + Restore Drills: Building Enterprise-Grade Safety

System AdminSeptember 21, 2022247 views6 min read

Backups You Never Test Are Backups You Cannot Trust

Every PostgreSQL administrator knows they should have backups. Most do. What far fewer do is regularly test their restores. The uncomfortable truth is that a backup only proves its value when you successfully restore from it. A corrupted dump file, an incomplete WAL archive, or a restore process nobody has practiced in eighteen months can turn a recoverable incident into a catastrophic data loss event.

This guide covers the practical side of PostgreSQL backup and restore operations: the difference between logical and physical backups, Point-in-Time Recovery (PITR), integrity verification, and how to run restore drills without disrupting your production environment.

Logical Backups: pg_dump and pg_dumpall

Logical backups export your database as SQL statements (or a custom binary format) that can recreate the schema and data. The primary tool is pg_dump for individual databases and pg_dumpall for the entire cluster including roles and tablespaces.

Advantages

  • Portability: Logical backups can be restored to any PostgreSQL version (with reasonable compatibility). This makes them ideal for migrations and upgrades.
  • Selectivity: You can dump specific tables, schemas, or databases — useful for partial restores or data extraction.
  • Human-readable: Plain-text SQL dumps can be inspected, modified, and filtered with standard text tools.

Limitations

  • Speed: For large databases (tens of gigabytes and above), pg_dump can take a long time to complete, and the restore is even slower because it replays individual SQL statements.
  • Point-in-time: A logical backup represents a snapshot of the database at the moment the dump started. You cannot restore to an arbitrary point in time — only to the exact time of the dump.
  • Consistency during heavy writes: pg_dump takes a consistent snapshot using a transaction, but long-running dumps on write-heavy databases can cause bloat and increase WAL generation.

Best Practices

Use the custom format (-Fc) for most backups — it supports compression, parallel restore, and selective restore of individual tables. Compress plain-text dumps with gzip or zstd to reduce storage size. Schedule logical backups during low-traffic periods to minimize the impact on production performance.

Physical Backups: Base Backups and WAL Archiving

Physical backups copy the raw database files at the filesystem level. Combined with continuous WAL (Write-Ahead Log) archiving, physical backups enable Point-in-Time Recovery — the ability to restore the database to any specific moment, not just the time of the last backup.

How It Works

A base backup is a copy of the entire PostgreSQL data directory, taken while the database is running. PostgreSQL's pg_basebackup tool handles this cleanly. After the base backup, continuous WAL archiving copies every WAL segment (the transaction log) to a separate storage location. To restore, you start from the base backup and replay WAL segments up to the desired point in time.

Advantages

  • Point-in-Time Recovery: Restore to any second within the WAL archive. If corruption happens at 14:32:17, you can restore to 14:32:16 and lose only one second of data.
  • Speed: Physical restores are fast because they copy files directly rather than replaying SQL statements.
  • Completeness: Physical backups include everything — all databases, roles, configuration, and internal state.

Limitations

  • Version-specific: Physical backups can only be restored to the same PostgreSQL major version. Cross-version restores require logical backups.
  • All-or-nothing: You cannot selectively restore a single table from a physical backup without restoring the entire cluster (though you can restore to a temporary instance and then dump the specific table).
  • Storage requirements: WAL archiving generates a continuous stream of files that must be stored and managed. High-write workloads generate WAL rapidly, and storage costs can accumulate.

Setting Up PITR

  1. Configure WAL archiving: Set archive_mode = on and archive_command to a script that copies WAL segments to your archive storage (local directory, object storage, or a dedicated backup server).
  2. Take a base backup: Run pg_basebackup to create the initial full backup. Store it alongside the WAL archive.
  3. Verify archiving: Confirm that WAL segments are being archived continuously by checking the archive directory and monitoring for gaps.
  4. Schedule periodic base backups: Take new base backups weekly or daily, depending on your WAL volume. More frequent base backups reduce the number of WAL segments needed for a restore, which reduces restore time.

Backup Integrity Verification

Verification is not optional — it is the step that separates a professional backup strategy from a hopeful one.

  • Checksum verification: Enable data checksums in PostgreSQL (initdb --data-checksums for new clusters, or pg_checksums to enable on existing clusters). Checksums detect silent data corruption in the storage layer.
  • pg_verifybackup: For base backups taken with pg_basebackup, use pg_verifybackup to validate that the backup is complete and uncorrupted.
  • WAL continuity check: Verify that WAL segments are continuous — no gaps. A missing WAL segment breaks PITR and limits your restore to the point before the gap.
  • Test restores: The ultimate verification. Restore the backup to a test environment and verify that the database starts, accepts connections, and contains the expected data.

Running Restore Drills

Restore drills should be a scheduled, recurring activity — quarterly at minimum. Here is the process:

Prepare the Test Environment

Use a separate server or a Docker container — never restore a drill backup onto your production server. The test environment should have a compatible PostgreSQL version and enough resources to run the restore.

Execute the Restore

For logical backups: create a fresh database and run pg_restore with the backup file. For physical backups with PITR: restore the base backup, configure the recovery target (either the latest available point or a specific timestamp), and start PostgreSQL in recovery mode. Document each step and the time it takes.

Validate the Data

After the restore completes, validate the data:

  • Check table row counts against known baselines.
  • Run application-specific queries that verify critical data integrity.
  • Start your application against the restored database and verify that it functions correctly.
  • If you specified a PITR target time, verify that the data reflects that specific point.

Document and Improve

After each drill, document: the total time from start to a fully operational database, any issues encountered, and improvements for next time. Compare the total restore time against your RTO. If the drill takes four hours but your RTO is one hour, you need a faster approach — smaller base backups, faster storage, parallel restore, or a warm standby replica that can be promoted.

Backup Storage and Retention

  • Offsite storage: Store at least one copy of your backups in a different geographic location from your production database. Object storage in a different region is the most common approach.
  • Encryption: Encrypt backups at rest. Database dumps contain everything — credentials, personal data, business logic. Use GPG, age, or the encryption features of your storage provider.
  • Retention policy: Keep daily backups for 14 days, weekly backups for 8 weeks, and monthly backups for 12 months. Adjust based on your regulatory requirements and data change rate.
  • Immutability: For ransomware protection, store at least one backup copy in immutable storage that cannot be modified or deleted by the production server or its credentials.

Monitoring Backups

Set up monitoring that alerts you when:

  • A scheduled backup fails to complete.
  • A backup is smaller than expected (possible corruption or incomplete dump).
  • WAL archiving falls behind or shows gaps.
  • Backup storage is approaching capacity limits.

A backup that failed silently yesterday and was not noticed until today's failure is a compound risk. Monitor actively, and treat backup failures as incidents that require immediate attention.

The Bottom Line

PostgreSQL backups are not complete until they are verified, tested, and documented. Logical backups for portability and selective restore. Physical backups with PITR for precise recovery. Regular restore drills to prove the process works and measure your actual RTO. Encrypted, offsite, immutable storage to protect against the worst-case scenarios. Build this practice, maintain it, and sleep easier knowing that when the time comes, your data is recoverable.

WordPressBackupMySQL