Disaster Recovery

Disaster recovery in ZFS focuses on recovering from pool failures and using backups (whether local or remote) to restore data efficiently. ZFS offers several mechanisms to handle failures and restore data integrity. In cases of severe pool corruption or hardware failure, backups are critical for recovery.

Recovery Methods Overview

Recovery MethodUse CaseDescriptionCommand Example
ResilveringDisk failure in a mirrored or RAID-Z poolRebuilds data on a new disk after a disk failure, using redundant data from the pool.sudo zpool status to monitor progress
Snapshot RollbackData corruption or accidental deletionRestores the dataset to a previous point in time by rolling back to a snapshot.sudo zfs rollback mypool/mydataset@snapshotname
Pool ImportingSystem rebuild or hardware changeImports a pool to a new system after a rebuild or hardware change, recovering access to the pool.sudo zpool import mypool
Remote ReplicationPool failure requiring offsite recoveryRestores data from a remote backup or replicated system using snapshots and incremental changes.`sudo zfs send ...
Offsite BackupTotal system failure, offsite media backupRestores data from offsite storage, such as an external drive, in case of total system failure.sudo zfs receive mypool/mydataset < /media/usbdrive/backupfile
ScrubbingVerifying data integrity after recoveryReads and verifies all data, checking for corruption and repairing issues using redundancy.sudo zpool scrub mypool

Resilvering

Resilvering occurs when a failed disk in a mirrored or RAID-Z pool is replaced. ZFS reconstructs the missing data on the new disk from the remaining redundant data. During resilvering, the pool remains in a degraded state until the process completes.

To monitor pool status and check for resilvering progress:

$ sudo zpool status

The status output shows the current state of the pool, any degraded or faulted devices, and the progress of resilvering.


Snapshot Rollback

If data becomes corrupted or accidentally deleted, ZFS allows restoring the dataset to a previous state using snapshot rollback. This discards all changes made after the snapshot.

To roll back to a specific snapshot:

$ sudo zfs rollback mypool/mydataset@snapshotname

Pool Importing

In the event of a catastrophic system failure, where the system needs to be rebuilt, ZFS pools can be imported on new hardware. ZFS stores metadata in such a way that pools can be imported from any system running ZFS. After a system rebuild or hardware change, to import a pool:

$ sudo zpool import

This command lists available pools not currently imported. Once the target pool is identified, it can be imported with:

$ sudo zpool import mypool

Remote Replication

For pools that use regular remote replication, data can be restored from a remote system. If the local pool fails and cannot be recovered, restoring from a remote backup ensures that data can be recovered.

To restore from a remote snapshot:

ssh remotesystem sudo zfs send remotepool/mydataset@snapshotname | sudo zfs receive mypool/mydataset

For incremental replication (only transferring changes since the last snapshot):

$ sudo zfs send -i mypool/mydataset@previous_snapshot mypool/mydataset@latest_snapshot | ssh remotesystem sudo zfs receive remotepool/mydataset

Offsite Backup

In cases where offsite backups are stored on external storage devices, recovery involves loading the backup file back into the ZFS pool. If the primary system is completely lost or irreparable, this method ensures that data can still be restored.

To restore from an offsite backup file stored on external media:

$ sudo zfs receive mypool/mydataset < /media/usbdrive/backupfile

Scrubbing

After recovering from a pool failure or device replacement, it is important to verify data integrity by performing a scrub. Scrubbing reads all data in the pool, verifies checksums, and attempts to repair any inconsistencies. This process is particularly useful after resilvering.

To start a scrub:

$ sudo zpool scrub mypool

Monitor the progress of the scrub with:

$ sudo zpool status

Pre-Disaster Best Practices

Proactively implementing best practices can greatly improve the ability to recover from ZFS pool failures and ensure that data is available when needed. These steps help safeguard against catastrophic data loss and minimize downtime in the event of a failure.

Regular Scrubbing

Scrubbing is a key maintenance task in ZFS that detects and repairs silent data corruption before it becomes a critical issue. By regularly performing scrubs, pools can remain healthy and resilient to minor errors that would otherwise go unnoticed.

To ensure ongoing data integrity, schedule regular scrubs as part of a maintenance routine:

$ sudo zpool scrub mypool

Scrubbing should be done periodically based on the size and activity of the pool, typically every few weeks for active datasets.

Backup Verification

Even the best backup strategies are only effective if the backups themselves are intact and can be restored when needed. Regularly verify that backups are functioning correctly by testing the restoration process. Snapshots and replication processes should be tested to ensure that data can be recovered from them in a real disaster scenario.

Verify backup snapshots using the following command:

$ sudo zfs list -t snapshot

Additionally, perform test restores from offsite or remote backups to confirm that the recovery process is smooth and the backups are valid.

Use Both Local and Remote Backups

To mitigate the risk of losing data due to local failures, such as hardware damage, fire, or theft, it's important to use both local and remote backups. Local backups provide fast recovery, while remote or offsite backups offer protection against physical disasters.

Remote replication can be set up using:

$ sudo zfs send mypool/mydataset@snapshotname | ssh remotesystem sudo zfs receive remotepool/mydataset

Offsite backups can be stored on external media or cloud-based solutions, ensuring that data remains accessible in the event of a local disaster.

Implement Redundancy

Pools should be configured with redundancy to prevent total data loss in the event of disk failure. ZFS offers options such as RAID-Z and mirroring, which help protect against single or multiple disk failures, depending on the configuration.

For example, RAID-Z configurations help safeguard data by distributing parity across disks, allowing the system to rebuild the pool even after one or more disk failures.

In addition to redundancy, replication to a remote system further enhances data protection by providing an additional copy of the data that can be quickly recovered if the local system encounters irrecoverable failures.