Unix files have two parts we are concerned about -- the data itself, and the "link" -- what you might think of as the "directory entry" that points to the data of the file itself. Just as a reference in a book's table of contents or index is not the data itself, a link/directory entry isn't the file. A key feature of Unix file systems is that one file can have multiple links pointing to it.
That is important to understand. A physical file can be accessed through multiple links, and thus, multple file names or directory locations. While Unix keeps track of how many links a file has, it does not care about which link was made first, there's no pecking order of links; all are equal. As long as at least one link still exists to a file, it remains accessible. Once the last link as been deleted, the file's disk space is released. So yes, you can create a file, create three more links (four total), remove any of the links in any order. There may look like there are four files, but the space is only taken once, but all the directory entries are "equal". A change to any of these files is refected in all (though this is sometimes complicated by the fact that an app may make a backup of a file and make changes to that copy (a different file!), then change the original link to point to the new file, so ... it may not look like this is true).
A hard link can't cross file systems. Which...if you think about it, makes a lot of sense. The directory entry and the file it is pointing to must reside on the same file system.
The symbolic link (symlink) is a very special file, but acts as a redirection indicator. It basically says, "You are looking at me, but you should actually look for your data at this other path". So, a symlink points to another link, which points to the actual file. Or that other link could be another symlink. Yes, you can have an infinite loop of symlinks; A points to B, B points to A. All modern Unixes will detect this and report about too many indirections after some point.
Yes, very much like the Windows shortcut. Except that GENERALLY, everything happens under the cover of the operating system. Your application can find out if something is a symlink rather than a "real" file, but generally, that is "Doing It Wrong". Your application should just be looking at files and directories, and symlinks should be something the administrators work with. Symlinks can cross file systems -- often to great advantage.
For example, an application may want to put data in a particular directory. If the file system that directory exists in is out of space, the administrator might copy the existing directory to another file system (or a new file system), then replace the old directory with a symlink to the new one. The application should not notice this happened (unfortunately, some apps like to "help" with this, and usually screw it up when they do).
Now, if the file /etc/hosts hasn't changed since the previous backup, rather than copying the file over, a new hard link is made to the same physical file as the previous backup. So -- the files /bu/firewall1/2022-02-24/etc/hosts and /bu/firewall1/2022-02-25/etc/hosts take up only one "file" worth of storage. Creating an additional link to this file is very quick, and requires basically no network traffic once the file has been confirmed to have not changed. For (typically) small files like /etc/hosts, this isn't a big win, but if you have large files that rarely change, this can be a huge benefit.
Some types of files are better than others. For example, mbox mail spools (all messages for one user in one file) will have to be entirely copied each backup there is a change, but maildir format mail spools (each message is its own file) will have only the NEW files copied over. Putting some numbers on it -- if you have a 100MB mbox file that gets 100k of new messages each day, you will end up with a 100MB growth in backup disk space usage each day because of this file. But if you use maildir, most of the 100MB will be just hard linked, only the 100kb of new files will be copied over the network and use additional space in the new backup.
Note that rsync has an algorythm they are very proud of where only the changed parts of a file are transfered over the wire. My experience with this has, unfortunately, been bad. I have found in REAL WORLD cases, rsync will often spend way too much time and effort trying to find changes and non-changes and sending over the deltas than just sending over the whole file. Perhaps even worse, the time required for a backup could vary widely -- from minutes to hours, which created problems when we thought we knew how long a backup would take, but lots of little changes were made in big files, and suddenly minutes turned to hours. For this reason, I have disabled the rsync delta protocols, and just using the "whole file" option. The results I have seen were not only better predictability, but also usually faster backups. Your results may vary. You are encouraged to experiment.
The first time I saw this backup system was in a project called Dirvish. This project started by making hardlinks of all the data from your previous backup to your new backup, then rsync'ing over from the system being backed up to the new backup directory. Great, but it was rendered a bit over-engineered when the rsync people added the --link-dest option, which basically moved most of the cool Dirvish code into rsync itself.
There are a number of these kinds of projects out there, I really think for the most part, this idea is simple enough a good system administrator should be able to set it up with just a little guidance -- my goal here is to provide that guidance based on running this kind of backup now for going on close to 20 years. You are welcome to use my code, but it is important you understand it so you can adjust it to your needs. And my code may suck. I admit this.
My default is to have the backups appear to reside in /bu, but that is just a tiny partition or subdirectory with a bunch of symlinks that point into /v/1, /v/2, etc.
So, while the actual data is scattered over several volumes, the data can all be accessed from within the /bu/ directory.$ ls -l /bu total 0 lrwxr-xr-x 1 root wheel 12 Aug 9 12:27 console -> /v/2/console lrwxr-xr-x 1 root wheel 8 Oct 20 2020 dbu -> /v/3/dbu lrwxr-xr-x 1 root wheel 9 Jul 4 14:20 dbu1 -> /v/1/dbu1 lrwxr-xr-x 1 root wheel 11 Aug 29 2019 fluffy3 -> /v/1/fluffy lrwxr-xr-x 1 root wheel 17 Aug 29 2019 g2 -> /v/1/g2 lrwxr-xr-x 1 root wheel 7 Aug 29 2019 gw -> /v/2/gw lrwxr-xr-x 1 root wheel 31 Apr 1 2020 gwold -> /v/1/gwold lrwxr-xr-x 1 root wheel 8 Aug 29 2019 hc1 -> /v/1/hc1 lrwxr-xr-x 1 root wheel 9 Oct 30 2021 hc1archive -> /v/3/hc1archive lrwxr-xr-x 1 root wheel 22 Aug 17 2020 monitor -> /v/1/monitor lrwxr-xr-x 1 root wheel 10 Dec 19 2021 node1 -> /v/2/node1 lrwxr-xr-x 1 root wheel 10 Dec 21 2021 node2 -> /v/1/node2 lrwxr-xr-x 1 root wheel 10 Sep 4 2020 suzy2 -> /v/1/suzy2 lrwxr-xr-x 1 root wheel 31 Aug 9 17:31 web.holland-consulting.net -> /v/2/web.holland-consulting.net lrwxr-xr-x 1 root wheel 11 Sep 14 2019 z-logs -> /var/z-logs
Some people will argue, "It's just a backup". Except, if you lose your backups, you lose your historical archive. And if you lose your backup, your data is no longer protected. The only excuse for not having redundant storage on your IBS system is if the alternative is "no backup at all". I've worked in places like that; I get it. (in fact, one of those places sent me a screen shot several years after I left them of a system I had set up but after I left, they had forgotten -- almost 3000 days of uptime on a system I cobbled together out of spare parts, including a non-redundant hard disk and a single-powersupply machine).
Holland Consulting home
Page
Contact Holland Consulting
since August 10, 2022
Page Copyright 2022, Nick Holland, Holland Consulting.