Special High Intensity Training: How it works

Key to understanding IBS is understanding the concept of a Unix link. There are two types of links in Unix, the Hard Link and the Symbolic link. The magic of IBS is based on the hard link, but a real-world IBS system will probably make use of both, so I will do a review of both here.

The Hard Link

If while reading this, you think, "That is like a Windows shortcut!", no, you are missing it. Read it again, or search for "unix hard vs symbolic links".

Unix files have two parts we are concerned about -- the data itself, and the "link" -- what you might think of as the "directory entry" that points to the data of the file itself. Just as a reference in a book's table of contents or index is not the data itself, a link/directory entry isn't the file. A key feature of Unix file systems is that one file can have multiple links pointing to it.

That is important to understand. A physical file can be accessed through multiple links, and thus, multple file names or directory locations. While Unix keeps track of how many links a file has, it does not care about which link was made first, there's no pecking order of links; all are equal. As long as at least one link still exists to a file, it remains accessible. Once the last link as been deleted, the file's disk space is released. So yes, you can create a file, create three more links (four total), remove any of the links in any order. There may look like there are four files, but the space is only taken once, but all the directory entries are "equal". A change to any of these files is refected in all (though this is sometimes complicated by the fact that an app may make a backup of a file and make changes to that copy (a different file!), then change the original link to point to the new file, so ... it may not look like this is true).

A hard link can't cross file systems. Which...if you think about it, makes a lot of sense. The directory entry and the file it is pointing to must reside on the same file system.

Symbolic (Soft) Links

This is where you can say, "That's like a Windows shortcut!" without being wrong.

The symbolic link (symlink) is a very special file, but acts as a redirection indicator. It basically says, "You are looking at me, but you should actually look for your data at this other path". So, a symlink points to another link, which points to the actual file. Or that other link could be another symlink. Yes, you can have an infinite loop of symlinks; A points to B, B points to A. All modern Unixes will detect this and report about too many indirections after some point.

Yes, very much like the Windows shortcut. Except that GENERALLY, everything happens under the cover of the operating system. Your application can find out if something is a symlink rather than a "real" file, but generally, that is "Doing It Wrong". Your application should just be looking at files and directories, and symlinks should be something the administrators work with. Symlinks can cross file systems -- often to great advantage.

For example, an application may want to put data in a particular directory. If the file system that directory exists in is out of space, the administrator might copy the existing directory to another file system (or a new file system), then replace the old directory with a symlink to the new one. The application should not notice this happened (unfortunately, some apps like to "help" with this, and usually screw it up when they do).

Back to IBS

Each backup is done to a separate directory. Typical file path will be in the format /buroot/machinename/date, for example, /bu/firewall1/2022-02-24. The entire file tree (or at least, the desired parts) are copied from the source machine rooted in that target directory. So, following the example above, the file /etc/hosts on the backup system will be in /bu/firewall1/2022-02-24/etc/hosts.

Now, if the file /etc/hosts hasn't changed since the previous backup, rather than copying the file over, a new hard link is made to the same physical file as the previous backup. So -- the files /bu/firewall1/2022-02-24/etc/hosts and /bu/firewall1/2022-02-25/etc/hosts take up only one "file" worth of storage. Creating an additional link to this file is very quick, and requires basically no network traffic once the file has been confirmed to have not changed. For (typically) small files like /etc/hosts, this isn't a big win, but if you have large files that rarely change, this can be a huge benefit.

Some types of files are better than others. For example, mbox mail spools (all messages for one user in one file) will have to be entirely copied each backup there is a change, but maildir format mail spools (each message is its own file) will have only the NEW files copied over. Putting some numbers on it -- if you have a 100MB mbox file that gets 100k of new messages each day, you will end up with a 100MB growth in backup disk space usage each day because of this file. But if you use maildir, most of the 100MB will be just hard linked, only the 100kb of new files will be copied over the network and use additional space in the new backup.

Note that rsync has an algorythm they are very proud of where only the changed parts of a file are transfered over the wire. My experience with this has, unfortunately, been bad. I have found in REAL WORLD cases, rsync will often spend way too much time and effort trying to find changes and non-changes and sending over the deltas than just sending over the whole file. Perhaps even worse, the time required for a backup could vary widely -- from minutes to hours, which created problems when we thought we knew how long a backup would take, but lots of little changes were made in big files, and suddenly minutes turned to hours. For this reason, I have disabled the rsync delta protocols, and just using the "whole file" option. The results I have seen were not only better predictability, but also usually faster backups. Your results may vary. You are encouraged to experiment.

The first time I saw this backup system was in a project called Dirvish. This project started by making hardlinks of all the data from your previous backup to your new backup, then rsync'ing over from the system being backed up to the new backup directory. Great, but it was rendered a bit over-engineered when the rsync people added the --link-dest option, which basically moved most of the cool Dirvish code into rsync itself.

There are a number of these kinds of projects out there, I really think for the most part, this idea is simple enough a good system administrator should be able to set it up with just a little guidance -- my goal here is to provide that guidance based on running this kind of backup now for going on close to 20 years. You are welcome to use my code, but it is important you understand it so you can adjust it to your needs. And my code may suck. I admit this.

Implementation

IBS, as I distribute it, assumes a level of indirection between where the backups appear to reside (and as far as IBS is concerned, where they run) and where they physically live.

My default is to have the backups appear to reside in /bu, but that is just a tiny partition or subdirectory with a bunch of symlinks that point into /v/1, /v/2, etc.

$ ls -l /bu
total 0
lrwxr-xr-x  1 root  wheel  12 Aug  9 12:27 console -> /v/2/console
lrwxr-xr-x  1 root  wheel   8 Oct 20  2020 dbu -> /v/3/dbu
lrwxr-xr-x  1 root  wheel   9 Jul  4 14:20 dbu1 -> /v/1/dbu1
lrwxr-xr-x  1 root  wheel  11 Aug 29  2019 fluffy3 -> /v/1/fluffy
lrwxr-xr-x  1 root  wheel  17 Aug 29  2019 g2 -> /v/1/g2
lrwxr-xr-x  1 root  wheel   7 Aug 29  2019 gw -> /v/2/gw
lrwxr-xr-x  1 root  wheel  31 Apr  1  2020 gwold -> /v/1/gwold
lrwxr-xr-x  1 root  wheel   8 Aug 29  2019 hc1 -> /v/1/hc1
lrwxr-xr-x  1 root  wheel   9 Oct 30  2021 hc1archive -> /v/3/hc1archive
lrwxr-xr-x  1 root  wheel  22 Aug 17  2020 monitor -> /v/1/monitor
lrwxr-xr-x  1 root  wheel  10 Dec 19  2021 node1 -> /v/2/node1
lrwxr-xr-x  1 root  wheel  10 Dec 21  2021 node2 -> /v/1/node2
lrwxr-xr-x  1 root  wheel  10 Sep  4  2020 suzy2 -> /v/1/suzy2
lrwxr-xr-x  1 root  wheel  31 Aug  9 17:31 web.holland-consulting.net -> /v/2/web.holland-consulting.net
lrwxr-xr-x  1 root  wheel  11 Sep 14  2019 z-logs -> /var/z-logs
So, while the actual data is scattered over several volumes, the data can all be accessed from within the /bu/ directory.

Other tips


 
 

Holland Consulting home Page
Contact Holland Consulting
 

since August 10, 2022

Page Copyright 2022, Nick Holland, Holland Consulting.