Creating an rsync backup script

Why use rsync for backup?

Note: if you want to encrypt your backup device, see the follow-up article: rsync backup to encrypted volume.

It's important to backup your important computer files to avoid losing them if your PC fails. But if you have gigabytes of data you want to keep safe, simply copying all of your files from one place to another will likely take hours.

Luckily Linux offers a command-line tool called rsync which only copies a file to a backup location if the file is new or has changed since the last backup. This selective copying usually means an rsync backup takes a fraction of the time taken by a simple "copy all" backup.

Seeing as you ought to backup your system frequently, it makes sense to create a backup script which runs customised rsync commands to backup all of the files you want to keep, but none of those you don't. This page is all about creating a Bash script which runs an rsync backup. I'm using Kubuntu 14.10, but the advice on this page should apply to users of other Linux distributions (though you may need to vary some steps, so check carefully for any incompatible commands, options, etc).

Important note: this page is intended to help people discover how useful rsync can be. You follow the advice on this page at your own risk, so make sure you understand the consequences of any commands or actions you intend to take. Read the man pages for any commands or flags with which you're not familiar.

Before you backup

The first step is deciding which directories need regular backup. Even though Linux keeps most user files in the /home/ directories, it's possible that you have valuable files stored elsewhere on the system. I have a file backup checklist which might help you to avoid forgetting anything.

Once you've decided which directories need to be part of the backup, it's time to decide which of their sub-directories do not need to be part of the backup. For example, in Ubuntu Linux when running the GNOME desktop, each home directory contains a hidden directory called ".gvfs" which caused me a lot of trouble when I first started using rsync (the directory seems to refer to itself, so that rsync gets stuck in an infinite loop of backing up the same files over and over again). You may also want to avoid backing up your deleted items, so the ".local/share/Trash" directory needs to be skipped. You might also have other directories which you don't need in your backup, so note these down.

A useful tool is the Disk Usage Analyser. This can scan a directory tree and graphically show you how much disk space each directory uses. This sort of analysis can turn up some surprises, such as a log file which has grown to a massive size.

Finally, to reduce the amount of backup time and storage needed, it makes a lot of sense to go through your selected directories and see whether there are old or duplicated files which can be deleted. A useful tool for finding duplicate files is FSlint which can search a target directory and report on either duplicate files or empty directories, both of which clutter up your system. Just make sure you carefully read the instructions for FSlint before you use its delete or merge options because they can radically change your filesystem.

Mounting your backup volume

A backup should survive the failure of your main hard disk, so it makes little sense to save your backup to a path on that disk. Which means that you're likely to be using an external hard disk, a network-attached storage device on your own network, or a storage server on a remote network.

Backup to a USB external hard drive

Before you can write to a hard disk you need to have a suitably large target partition on the external disk. The easiest way to do this is to connect the external drive (which Ubuntu will mount automatically for you) and then run the graphical tool GParted (which ought to be installed by default in Ubuntu and can be found in the main menu under System / Administration). With your USB hard drive connected you can see the details of any existing partitions (and make certain you are looking at your USB-connected hard disk and not one of your main system hard disks). Then if necessary you can tell GParted to delete partitions, resize partitions, or create new partitions in unallocated space.

For your backup volume, it makes sense to create a partition which uses the same file system as the volume which hosts your personal files. For most Ubuntu users their home directory will be held on a partition which is formatted using the ext3 or ext4 partition (and you can check this using GParted), so an ext3 or ext4 partition is probably the best choice for your backup volume. You also need to make the backup volume partition at least large enough to contain all of the files you intend to backup, but it will usually be wise to make the backup partition at least two or three times bigger than currently necessary, as you're very likely to need more space as time goes on. If you want to keep historical backup copies (for example, one for each year) then you'll need even more space.

Note that if you need to be able to read the partition from a Windows operating system, you'll need to use either the NTFS or FAT32 format because Windows has no time for non-Microsoft formats. The problem with using one of these formats is that they won't be able to correctly store your Linux file structure, and information such as symlinks and owner and group permissions will be discarded when writing to an NTFS or FAT32 partition. The data in your files ought to remain intact, but the loss of metadata can mean that restoring from backup results in a very different data landscape to what you had originally, and you'll have to spend a lot of time redefining permissions on your restored files.

Adding an external hard drive to /etc/fstab

Note: many newer versions of Linux (including Kubuntu 14.10) will mount an external hard drive partition to a consistent path based on the volume label given to the partition. For example, if the partition has the volume label "Buffalo_backup" then when user bob attaches the drive and mounts that partition, Kubuntu will mount it at path /media/bob/Buffalo_backup and this path should be the same each time. If you find this is the case in your Linux desktop then you can skip this section and leave you /etc/fstab file untouched.

Sometimes, however, Linux will allocate a different device name each time you connect a USB hard drive. For instance the drive might appear as /dev/sdd5 one time and then /dev/sde4 another time. Because this will be a nuisance from the point of view of a backup script, we need to add the UUID of our USB backup target volume to the /etc/fstab file so that we can mount the drive using the same command each time. The UUID of the backup partition can be found in GParted by right-clicking on the target partition and selecting "Information". Next you need to decide where you want this backup target mounted, such as /media/Buffalo_backup for instance, and also what mount options should be used. Add all this information to the /etc/fstab file in a line which will look something like this:

# The Buffalo DriveStation backup partition
UUID=5fc5ac7d-085e-4f3f-b5b2-6c7ee32b3d9c /media/Buffalo_backup ext4
    noauto,group,relatime,journal_checksum,auto_da_alloc

(Note that the second and third lines should actually all be one single line, but they won't fit on this webpage as one line.)

The group mount option is a little strange, as it tells Linux to permit mounting of this partition to any user who is a member of the same group as the special device which represents this partition, such as /dev/sdd5. But USB devices don't always get the same special device path, and the owner and group values for special devices seem to get reset to "root" and "disk" regularly anyway. So if you want to limit mounting of your backup partition to a particular set of users, you can use the group mount option and then make those users members of the "disk" group. (I'd much rather see an explicit "group=somegroup" mount option, but this method will have to do for now.)

Also note the noauto mount option. This tells Linux not to automatically mount the drive (on bootup for example). However, Ubuntu still seems to attempt to mount the partition if you connect the USB hard drive after bootup, which is annoying as it generates a permissions error if you're using the group mount option. I don't know how to force Ubuntu (or GNOME) to obey the noauto option.

The mount path (/media/Buffalo_backup in our example) should be an empty directory and it should already exist, so make sure to create the path using mkdir if it does not already exist. Also, the mount point path must not contain spaces as this might cause trouble later on.

Now users who belong to group "disk" ought to be able to mount the backup partition on the USB hard drive by simply typing mount /media/Buffalo_backup into a terminal (while the USB drive is connected and powered on, obviously). This makes it easy to mount the drive from within a script, as the partition can now be mounted with the same command every time.

Backup to a network-attached storage device using Samba

If, for example, you have a network-attached storage drive on your network at 192.168.178.250 which offers a Samba share called "share" available to Samba user "sam" and you want the mountpoint to be /media/Maxtor_backup and marked as being owned by Linux user account "bob" and Linux group "bob", then you can do this in one easy command:

sudo mount -t smbfs -o username=sam,uid=bob,gid=bob \
//192.168.178.250/share /media/Maxtor_backup

(Note that a backslash followed immediately by a single newline is ignored by Bash, so they are used on this webpage to avoid scrollbars where long commands appear. But you don't have to have these line breaks in your own script.)

If the mount point path (/media/Maxtor_backup in this example) does not already exist, you'll need to create it with mkdir first. Your mount point should be an empty directory (to avoid confusion, otherwise the files in that directory will be temporarily unreachable while the backup volume is mounted on the same path). Also, the mount point path must not contain spaces because this may lead to trouble later.

If the connection is made successfully, you will be prompted to enter the password for the Samba user account "sam". (Note that you'll first be asked for your Linux password if sudo has not already been used in the last few minutes.)

If your target Samba server supports CIFS Unix extensions then you may not need to specify the uid and gid arguments. But if you do specify these arguments, or if the target server does not support the CIFS Unix extensions, be aware that the owner and group information for your files will probably not be applied to the backup copies. This is unlikely to affect the actual content of the backup files, but it will mean that if you need to restore your files from this Samba share then every file will be owned by the owner and group specified in the uid and gid arguments. The mode information will also be discarded, and symlinks will be skipped or throw a warning. This is likely to mean a lot of permissions restructuring on your restored filesystem to get it to resemble your original filesystem.

Backup to a remote server

If you're connecting to a remote server (that is, you're sending your backup to a machine which is not on your local network) then you don't need to mount your backup volume, because you simply tell the rsync command to connect to a remote path. However, this is an advanced topic and is not covered by this page. See the rsync and rsyncd.conf man pages for more information about sending a backup to a remote server.

(If you are sending your backup to a remote machine, be aware that the rsyncd.conf man page recommends using SSH to encrypt the transfer, because using the rsync daemon protocol currently offers no encryption of the data in transit.)

The rsync command

Suppose we've identified the /var/www/ directory tree (that is, everything in /var/www/ and its sub-directories and so on) as being in need of regular backup. And that we've identified that the /var/www/.Trash-1000/ directory does not need to be part of the backup. And that we don't want any hidden directories to be included in the backup. Then if our backup volume is an ext (Linux) filesystem mounted at /media/Buffalo_backup/ we might use the following call to rsync:

sudo rsync --archive --hard-links \
--verbose --human-readable --itemize-changes --progress \
--delete --delete-excluded --exclude='/.Trash-1000/' --exclude='/.*/' \
/var/www/ /media/Buffalo_backup/var/www/

(Again, note that a backslash followed immediately by a single newline is ignored by Bash, and you don't have to have these line breaks in your own script, it just makes this webpage look tidier.)

This command basically says copy new and changed files from the /var/www/ directory tree to the /media/Buffalo_backup/var/www/ directory tree. It's very important to end both of these paths with a forward slash, otherwise the behaviour of rsync changes and the wrong set of files will be backed up to the wrong destination directory. Also note that rsync will not create the backup directory for you, and will throw an error if the directory does not already exist. So before executing the rsync command you may want to call:

mkdir --parents /media/Buffalo_backup/var/www/

The behaviour of rsync can be modified using its many different flags (or options). The flags I've used are:

--archive

This is intended for backup jobs and is actually just an alias for --recursive, --links, --perms, --times, --group, --owner, and --devices --specials (which are all described below). Do not use --archive if your backup volume does not support Linux permissions, owner/group information, symlinks, devices and special files, because permission and owner information will be discarded and you'll see warnings generated for every symlink, device or special file encountered. Instead just specify the flags that do apply to your backup volume, such as --recursive and --times.

--recursive: This tells rsync to backup /var/www/ and every directory, sub-directory and so on contained within /var/www/ (in other words: the entire directory tree).
--links: This tells rsync to copy symlinks as symlinks, which won't work if your backup volume does not support symlinks. You can tell rsync to backup the symlink target files instead by using the --copy-links flag. See the rsync man page for the details, and also read about the --copy-unsafe-links and --safe-links flags which may be of interest.
--perms: Tells rsync to maintain the permission modes in the backup copy. This will not work if your backup volume does not support Linux permissions.
--times: Tells rsync to transfer modification times to the backup files. This is necessary to allow rsync to quickly work out whether files have changed since the last backup, and file modification times should be supported by almost every filesystem, so it's very likely a good idea to enable this flag when calling rsync. Note that if you're using a Microsoft filesystem you ought to read the rsync man page to find out about the --modify-window flag.
--group and --owner: Tells rsync to maintain the group and owner settings of the files being written to the backup volume. This won't work if the backup volume does not support this information (in which case the owner and group values are likely to be discarded).
--devices --specials: Tells rsync to transfer device and special files. If you're backing up personal directories you likely won't need to worry about these types of file, but if your source directory tree contains devices, sockets or other special files, refer to the rsync man page for more information.

--hard-links

Tells rsync to create hard links between files on the backup volume where hard links exist between files in the source tree. Without this flag, the backup will not use hard links and separate copies of the file will be written to the backup volume, using up more disk space and losing the information that hard links existed. Note that rsync cannot preserve information about hard links to files outside of your source tree.

--verbose

Just tells rsync to produce output, which is useful for creating a log file or monitoring progress. You can get rid of this flag if you have no need for output.

--human-readable

This prints file sizes with a suffix such as K (kilobyte), M (megabyte) or G (gigabyte). Use -hh if you want kibibytes, mebibytes and gibibytes instead.

--itemize-changes

Causes a list of files to be output by rsync, showing if and why they are being sent to the backup volume.

--progress

Shows a progress bar which lets you see how much of the current file has been transferred (purely to reassure you that something is actually happening when very large files take a long time to transfer).

--delete

Tells rsync to delete files on the backup volume which are no longer found in the source tree (files which you've deleted since the last backup).

--delete-excluded

Tells rsync to also delete files from the backup volume if they are now marked as being excluded using the --exclude flag.

--exclude

Tells rsync not to backup the specified directory. Wildcards such as the asterisk can be used. For instance in our example we've chosen to exclude '/.*/' which means that every hidden directory (whose name begins with a dot) will be skipped by rsync. (Warning: I cannot recommend excluding all hidden directories, because sometimes very important files are kept in hidden directories. For example, a Git repository stores its commit data in a .git directory, and forgetting to backup your Git repository data will hurt. It's much better to exclude specific hidden directories which you are sure you don't want.)

This is just a small fraction of the total number of flags that rsync offers to modify the way it behaves. Make sure to see the man page for rsync to find out whether any of the other flags are more suitable to your requirements (and to understand better the flags listed on this page).

The rsync backup script

Given that rsync has so many options, once you've crafted the rsync command that does what you need it's a good idea to store it in a script so that you can easily call it again in future. The example script below wraps up the following actions:

Mount the backup volume if it's not already mounted.
Run the rsync command (or multiple rsync commands as in the example below).
Offer to unmount the backup volume.

Here is the script in full:

#!/bin/bash

# Script to backup personal files to the external USB drive.
# Specify the mount point here (DO NOT end mount_point with a forward-slash).
mount_point='/media/Buffalo_backup'

echo "#####"
echo ""
# Check whether target volume is mounted, and mount it if not.
if ! mountpoint -q ${mount_point}/; then
	echo "Mounting the external USB drive."
	echo "Mountpoint is ${mount_point}"
	if ! mount ${mount_point}; then
		echo "An error code was returned by mount command!"
		exit 5
	else echo "Mounted successfully.";
	fi
else echo "${mount_point} is already mounted.";
fi
# Target volume **must** be mounted by this point. If not, die screaming.
if ! mountpoint -q ${mount_point}/; then
	echo "Mounting failed! Cannot run backup without backup volume!"
	exit 1
fi

echo "Preparing to transfer differences using rsync."

# Use the year to create a new backup directory each year.
current_year=`date +%Y`
# Now construct the backup path, specifying the mount point followed by the path
# to our backup directory, finishing with the current year.
# (DO NOT end backup_path with a forward-slash.)
backup_path=${mount_point}'/rsync-backup/'${current_year}

echo "Backup storage directory path is ${backup_path}"

echo "Starting backup of /home/bob . . . "
# Create the target directory path if it does not already exist.
mkdir --parents ${backup_path}/home/bob/
# Use rsync to do the backup, and pipe output to tee command (so it gets saved
# to file AND output to screen).
# Note that the 2>&1 part simply instructs errors to be sent to standard output
# so that we see them in our output file.
sudo rsync --archive --verbose --human-readable --itemize-changes --progress \
--delete --delete-excluded \
--exclude='/.gvfs/' --exclude='/Examples/' --exclude='/.local/share/Trash/' \
--exclude='/.thumbnails/' --exclude='/transient-items/' \
/home/bob/ ${backup_path}/home/bob/ 2>&1 | tee /home/bob/rsync-output.txt

echo "Starting backup of /var/www . . . "
mkdir --parents ${backup_path}/var/www/
# This time use the -a flag with the tee command, so that it appends to the end
# of the rsync-output.txt file rather than start a new file from scratch.
sudo rsync --archive --verbose --human-readable --itemize-changes --progress \
--delete --delete-excluded \
--exclude='/.Trash-1000/' \
/var/www/ ${backup_path}/var/www/ 2>&1 | tee -a /home/bob/rsync-output.txt

# Ask user whether target volume should be unmounted.
echo -n "Do you want to unmount ${mount_point} (no)"
read -p ": " unmount_answer
unmount_answer=${unmount_answer,,}  # make lowercase
if [ "$unmount_answer" == "y" ] || [ "$unmount_answer" == "yes" ]; then
	if ! umount ${mount_point}; then
		echo "An error code was returned by umount command!"
		exit 5
	else echo "Dismounted successfully.";
	fi
else echo "Volume remains mounted.";
fi

echo ""
echo "####"

To modify this Bash script to suit you, first you need to change the value of the mount_point variable so that it matches the path at which you'll mount your backup volume. For an external USB drive this should match whatever mount path you've specified for the backup volume in your /etc/fstab file, as described earlier on this page, or whatever mount path Linux consistently uses when mounting the backup volume. For a Samba network share this needs to be whatever mount path you intend to use for the mount command.

If your backup volume is on an external USB drive then the simple mount ${mount_point} line can stay as it is. However, if your backup is to be written to a local Samba network share then you need to replace this mount command with something similar to the command suggested in the section about Samba earlier on this page, but make sure to use ${mount_point} in the command at the point where the mount path goes.

In the above script I've created a variable called current_year and then used it and mount_point to create a final variable called backup_path which is the actual path to which the backup files will be written. By using the current year, calculated using the date command, you can automatically create a new backup each year. This is a good idea, just in case your files ever become damaged or corrupted and you don't notice until after running your backup script. Then you can at least refer back to a previous year's backup of the file. In fact, you could do this monthly rather than yearly if you need to be paranoid. But bear in mind that every time you start a fresh backup, rsync will have to copy everything (because it's starting from scratch in a new, empty backup directory and won't be able to simply copy new and changed files since the last backup took place). This not only takes more time, it also takes up much more disk space.

Modify the value of the backup_path variable as you see fit, but make sure that it doesn't end with a slash, as this is added as necessary later in the script. Also make sure that it cannot contain spaces because spaces in the path, even if escaped with backslashes, may lead to the script calling rsync commands on partial paths (reading up to where the space is encountered) which is really not what you want. So check that the current_year and mount_point cannot contain spaces, and that any string literals you place around these variables to form the value of backup_path are free of spaces.

The mkdir command is called before each rsync command, to make sure that the directory structure needed to store each backup set definitely exists, as rsync will fail with an error if this isn't the case. Change the path after ${backup_path} from /var/www/ or /home/bob/ to match whichever path the corresponding rsync command will backup.

Next comes the rsync command, so replace this with your own carefully crafted rsync command. If you want to log the entire output of the rsync run, make sure to add 2>&1 to the end of the rsync command, and then pipe the output to the tee command. Give the tee command the path to a log file, and use the -a flag if you want tee to append to an existing log file rather than start a new one. Now all rsync output, including errors, will be output to the terminal and also written to the specified log file.

The above script contains two rsync commands, but your script can contain any number you like. Just bear in mind that you need to call the rsync command with sudo because the flags such as --perms, --owner and --devices require the command be run as super-user. This means that if one rsync command takes a long time to complete, you'll probably have to enter your password again by the time the script runs the next rsync command. This will become a nuisance if you have many rsync commands in one script, as the script may pause and wait for the password every time it reaches a new "sudo rsync" line. One way around this might be to run the script itself using the sudo command, which ought to mean you only need to enter your password once, though giving the entire script super-user privilege may introduce security risks so be wary if you take this route.

At the end of the script, the user is asked whether they want to unmount the backup volume. If the user enters y or yes then the backup volume will be unmounted. Any other input will leave the backup volume mounted.

Warning about rsync and timestamps

rsync relies on file timestamps to quickly work out whether or not a file has changed since the last backup. This usually works fine, but if you have some directory or file whose timestamp does not get updated even when its content changes then rsync will very likely fail to copy recent changes, leaving your backup copy further and further out of date.

As an example, I use VeraCrypt to manage a tiny encrypted file in my home directory, to hold sensitive data. But by default VeraCrypt does not update the last-modified timestamp of the encrypted file (presumably to increase plausible deniability in countries with authoritarian regimes). Because the timestamp was not getting updated, I almost lost several months worth of critical changes to text documents because rsync presumed that the encrypted file was exactly the same as it had been the last time I ran a backup. To disable this feature of VeraCrypt: go to "Settings", then the "Security" tab, and then untick (make empty) the box "Preserve modification timestamp of file containers". But also check that this has the desired effect, because there have been problems reported even with this box unticked.

Also bear this in mind with any other specialised file or directory whose last-modification timestamp may not be updated even when its content has changed.

If the timestamps are simply not going to work for your scenario, note that rsync also offers a --checksum option which tells rsync to use a 128-bit checksum, instead of the timestamp, to work out whether the file content has changed. The checksum mode will be considerably slower, and will involve considerably more IO (disk read) activity, so try to limit checksum mode to specific directories. See the follow-up article for an example of this targeted checksum approach.

Restoring from backup

Once you've run your rsync script and produced a backup of your files, they will be stored in a readily usable form on the backup volume. This means that restoring from backup is as simple as just copying everything from the most recent backup directory to your new working system disk. You can use your preferred method of copying everything from one place to another, but using the cp command which comes with Bash is probably the easiest method. Whichever method you use, don't forget about hidden files, which are not always copied automatically by command line tools, and are not always visible by default when using graphic user interface tools.

Copying everything using the `cp` command

Using a Bash console, the following command should copy all (including hidden) directories and files from the specified directory on the backup volume to the appropriate directory on the system disk.

cp --archive /media/Buffalo_backup/home/bob/. /home/bob/

Customise this command to suit your own file paths and read the man page for cp to check which flags suit you. (The man page for cp is not exactly rich with information, but read through it anyway and do some hunting around online for more information if you're not sure whether the --archive flag is what you need.)

The dot at the end of the source path is important: without the dot you will find that everything has been copied into a new directory on your system disk with the path /home/bob/bob which is very probably not what you want. Without the dot you'll probably also find that hidden files and directories don't get copied.

The --archive option tells the cp command to recurse into all sub-directories, and to preserve links and all file attributes. I have learnt the hard way that failing to preserve the timestamps will mean that the copies placed onto your system disk will all be set with the current date and time. Which might not bother you until you next run your rsync backup script and realise that every file on your system disk now appears to be newer than the backup copy, forcing rsync to copy every single file to the backup volume (and setting the timestamps of the files on the backup volume to the newer date). So even if you don't use the --archive flag, consider at least using the --preserve flag with a value of timestamps.

After restoring from backup

Once you've restored from backup, it's probably a good idea to rename the backup directory on the backup volume from "2014" to something like "2014 pre-rebuild". This is just so that if your new filesystem contains problems, your next backup won't overwrite the previous, trusted backup.

Also remember that if your backup volume does not support symlinks, file permissions, owner and group information, etc, you will need to manually reconstruct this structural information on your new filesystem once you copy the files from your backup.