My data backup strategy

Backing up data is one of those tasks that is easy to postpone because it takes a bit of effort to set up properly. Unfortunately, you often only realize how important a backup is at the moment you suddenly need one: disks fail, computers get lost or stolen and mistakes happen. There is no shortage of software designed to make backups easier, such as Duplicity, Restic or BorgBackup, but I wanted to keep things as simple as possible. This article describes how I back up my personal data, which I hope will be useful to others who are still working out a backup strategy of their own.

Backing up the right things

Before choosing a backup method, I first needed to decide which data is actually worth backing up. For me, the most important data lives in my home directory. This includes software projects, scanned documents and personal photos and videos. I generally do not bother backing up system configuration files, since using GNU Guix I can easily restore this state from an operating-system declaration. Similarly, Guix also provides a way to define and easily instantiate home environments, meaning I can also exclude most per-user configuration files. On a separate disk, I have a modest collection of entertainment media such as books, music, films, television shows and video games, which I would also like to back up. In comparison to the files in the home directory, the data here only changes occasionally when I add new media to the collection.

Backing up the things right

A good backup plan should include at least one off-site copy, in case a localized disaster, such as theft, fire or flooding, does not take out both the original data and its backup at the same time. I do not rely solely on cloud storage for the off-site copy due to several issues. One concern is availability. Cloud backups often depends on credentials that are stored on the very machine to be restored, creating a chicken-or-the-egg problem. Even if I still had the credentials on hand, cloud services can still be inaccessible due to outages, account lockouts or billing problems, all of which are outside my control. There is also another practical limitation. With several terabytes of data, backing up everything to the cloud would be slow and expensive. Note that RAID does does not substitute for a proper backup!

Therefore, I bought two identical 8-TB hard drives for the backup. Each drive is set up as an encrypted LUKS volume on top of btrfs. Encryption is essential because I do not want the data to be exposed if a drive is lost or stolen. I chose btrfs because it is a flexible and mature filesystem, built directly into the Linux kernel and features checksumming to detect silent data corruption, transparent file compression and snapshots of previous versions of files. The plan is to rotate the two hard drives on a regular schedule, so that the off-site copy is periodically updated with recent changes.

Two important metrics used to determine the backup plan are recovery time objective (RTO) and recovery point objective (RPO). RTO is the maximum acceptable downtime before the system is restored and usable again, whereas RPO is the time gap between the last backup and the point of failure. I aim to be able to fully set up and restore a working system within a single weekend day. If the system fails, I should be able to install GNU Guix, pull the system configuration, run guix system reconfigure and retrieve the contents of home directory from the backup drive within that time. I choose to perform weekly backups because I only use my personal computer a few times each week. In case I am working more often on a project, I will keep work-in-progress copies on cloud storage or a USB stick until the next scheduled backup. I rotate the two backup drives every four weeks. In the worst case, that means losing up to about a month of data, which is acceptable for most of what I store. If something is particularly important, I can always perform additional rotations as an exception.

The tools I use

I want the backup method to be recoverable with standard Unix tools. That rules out most backup software that invent their own archive format. If the tool becomes unmaintained or disappears, I do not want the archive to become unreadable.

The most basic approach is to use cp with the recursive flag -R.

cp -R /path/to/source /path/to/dest

The version of cp from GNU Coreutils provides extra functionality. The -a flag preserves file metadata, preserves symbolic links and enables recursive copying. The -u flag performs an update, only overwriting files that are newer in the source tree, which helps avoid unnecessary rewrites of unchanged data.

cp -au /path/to/source /path/to/dest

For a more portable solution using only POSIX functionality, pax can be run in copy mode with -rw. The -p e option preserves as much file metadata as possible.

pax -rw -p e -u /path/to/source /path/to/dest

These commands keeps existing files in place and only copies newer versions. If files are deleted from the source, these can be handled afterwards using find and rm as below. For convenience, you can use my mirror script, which performs these copying and deletion steps to synchronize the destination tree with the source tree.

find /path/to/dest -depth \! \( -exec test -e /path/to/source/{} \; -o -exec test -L /path/to/source/{} \; \) -exec rm -rf {} +

This command walks the destination tree and selects entries for which no corresponding path exists in the source. The -depth option ensures that directory contents are processed before the directory itself, so directories can be safely removed once they become empty. The test expression checks whether a path exists in the source either as a regular file or as a symbolic link; this is necessary because test -e follows symlinks and would otherwise treat dangling links as missing. Any entries that exist only in the destination are then removed with rm -rf.

An alternative is to use rsync. With the -a option, rsync preserves ownership, permissions, timestamps, symlinks and other common file attributes. Combined with --delete, it also removes files from the destination that no longer exist in the source, keeping the two trees in sync. This requires an external tool, however, and there has been some concerns about the introduction of contributions from AI agents in the code base. If you need to transmit the files to a remote machine, tar can be used in conjunction with SSH as a replacement.

rsync -a --delete /path/to/source /path/to/dest

This approach mirrors the current state of the source and does not retain older versions of files. If you want multiple versioned backups, there are a few ways to do this. One is to build snapshots manually using hard links or run rsync with the --link-dest option. Another is to make use of filesystem-level snapshots using btrfs. However, I prefer to use GNU tar for incremental backups with the --listed-incremental option as below.

tar --create --listed-incremental=/var/backups/home.snar --file=dest/$(date +%F).tar source

This produces a standard tar archive containing only the changes since the previous backup. The snapshot file is only required when creating new incremental archives; the archives themselves remain self-contained. When restoring, the archives are unpacked one by one in chronological order. A disadvantage of this is that restoring each archive in sequence be time-consuming. For this reason, it is useful to periodically create a fresh full backup.

For the slowly changing entertainment media collection, I typically simply mirror the files from the source disk to the backup directory, since retaining older versions is not important. As for the home directory, I create incremental tar archives and generate a new full backup each time the backup drive is rotated. The resulting archives are then encrypted using GnuPG.

Conclusion

In practice, this setup works fairly well and gives me some peace of mind that I will be able to recover my data even if an accident happens. It relies only on standard tools that are widely available, which keeps the recovery process straightforward. At the moment, the workflow is still fairly manual, since the backup drive needs to be plugged in each time a backup is created. One way to improve this would be to connect the backup drive to a small single-board computer and turn it into a network-attached storage (NAS) device, which can then be mounted as a network file system (NFS). From there, periodic backups could be automated using cron jobs. That is a project for another time, though. In the meantime, I hope this gives you some ideas for your own backup strategy.