If you are a user of any form of computer and care one bit about your sanity, then you probably have a backup strategy. Otherwise, if all hell breaks loose and your whole computer burns to ash or the hard drive melts to a heap of metal, turning it into an ugly door stop, you’ll likely be kinda angry, maybe slightly pissed, your pulse most definitely at 180, that you’ve lost all your data. I’d certainly be, especially about all my pictures of all the festivals and places I’ve been to.
(And maybe some family 😅)
But, to be honest, I’ve been a bit lazy about backups for some time now. I do have copies of all my important files, but that’s not a backup. It’s a copy. A backup lets you go back in time and get an older version of a file or folder, not just the most recent one that has been synced.
So why is it, that I’m not as diligent as I should be? There are a few factors in that equation. It’s laziness for one, knowledge that I do have at least one copy, the fact that I haven’t had any data loss so far and stinginess. Why the latter? Up until now, being a Windows user (not any more though, on my main machine), I was relying on Acronis True Image, a commercial backup software. However, the version that I own – 2014, I think – stopped being reliable in one of the past Windows 10 versions. I simply don’t want to spend the money any more.
I’m not here to tell you that I have changed my mind on that. No. I’m, of course, coding my own solution. Why wouldn’t I? Everything is done multiple times in the Open Source community.
Now, I don’t need anything fancy, if I’m honest. I could even go for a simple copy of my user folder every week or so and call it a day. That’s a bit too simplistic though and a huge waste of space. I want incremental backups. What I don’t want is a custom, proprietary file format and services running in the background. When I think about what I need in a backup then all I can come up with is a copy of all my data, stored in a compressed format for saving disk space and incremental updates of only the files that have changed since the last backup. In addition, or else it’s of no use, I’d like to be able to easily browse the backups and extract individual files. The top solution, of course, enables me to restore all data with a simple command – the disaster recovery.
Doesn’t sound too difficult now, does it? I’m a programmer, I can do that. So here we go. In this piece I’ll walk you through my thought process on the basic concepts of the application and the technology I will use.
The most important thing is how to store the backed-up data. This is very crucial to get right or otherwise it won’t scale with the amount of data or it will defeat the purpose of being simple.
Note:The solution will be designed to fit myneeds instead of being general purpose. It may not work for you.
The Backup Archive
As already mentioned, the major trait is the ease of use of the backup archive. I want to browse and access the data with the least amount of effort and required software. Therefore, I elect to go with ZIP containers. Windows Explorer has had ZIP support built in for several years now, macOS too, I think, and every Linux distro I would ever consider using comes with unarchiving tools pre-installed as well. This way I can just open the archive and extract data as I need to. It can’t be simpler than that.
Storing the Data
Looking back at Acronis True Image, it stores all data in one big file (which you can split based on size, but in concept it’s one huge contiguous blob), and every incremental update is its own file that only contains the changed elements. It is, however, still linked to the previous backup files and they, too, are needed to open the increments. I’m not sure If I want to do it similarly. It would certainly make it easier and a lot faster to copy the archives around. Huge files transfer much quicker across networks or other system buses than small files do. On the other hand, huge files are cumbersome to use, and I basically never transfer all my backups from one drive to the next unless I know it’s dying. And even then, I’d probably just create a new backup on the new drive.
Another option I’m pondering is the use of several ZIP archives, but only for the top-level folders. Everything in those folders would then go into one ZIP file and in the end, there’d be about four or five smaller ones instead of a single massive one. In my case all my data is located on a separate drive, grouped by type. I have a “Documents” folder, a “Pictures” folder, a “Music” folder and so forth. This would result in a “Documents.zip”, “Pictures.zip” etc. and all of them combined would comprise the complete backup. This would still have the benefit of bigger and fewer files that were to be copied around, but the files would be smaller for browsing. I could imagine an unarchiving tool being quicker to read one of those archives in order to get to a spreadsheet if it doesn’t have all my thousands of pictures and music files and huge video files in it.
Lastly, there is the option of zipping every single file, basically recreating the original folder layout only with compressed data in it. This would be the most convenient way to browse because it resembles the source data the closest. Navigate the folders, find the file, uncompress it, done. It does, however, create a lot of files which will result in more meta data for the filesystem and hard drive to manage. It’s also very likely that this will result in a bigger total backup archive size because of the overhead of each individual ZIP container. There is another upside to this solution as well though: parallelization. Since the output isn’t just one huge file, the processing could be done in parallel, potentially speeding up the process (depending on the source and target drive speed). Only access to the database will have to be synchronized but that’s certainly still a lot faster than waiting for a zip operation to finish.
One thing to consider, too, is restore time. Having just one big archive to manage is of course much simpler than thousands of individual ones. Especially if it is done manually (e.g. if you are in a hurry and have no way of executing the backup app yet). It is very likely also the most efficient way because only one file must be opened and from there the data can be read sequentially. This will result in much faster restore times. I do find the individual files more intriguing though, if only for the case of finding a single one. If a full restore is required, then I have lost time anyway and since it’s just for personal use and not for a business I don’t care that much about it. In past years I haven’t had any use for either of both scenarios, single file restore or disaster recovery. I do recall some instances many, many eons ago, where I did search my archives for random files. So, that’s what it will be. Just for the sake of convenience.
The tricky part is maintaining the individual archives and the contents. A proper ZIP file should provide all the necessary information, given that the original file and folder structure is retained in the archive as well. It has the file name, its size, the location and a checksum. It would probably not be the most performant solution when it comes to implementing an incremental backup. In those cases, it is necessary to check every file for changes and that could be done by comparing it with the previously archived version. I don’t want to load old archives though. If possible, to save filesystem accesses, it shouldn’t need to touch previously archived files at all.
That’s why I need some sort of lookup table, or, to call it by its proper name, a database. I do not want to jump too far ahead already, but I have to go a bit into technology to explain this. The simplest thing would be a proper database, something like SQLite. However, I plan to start implementing in Java and the SQLite wrappers I had tried in the past did not work for me. I know embedded databases in pure Java do exist, but that would only make it difficult to reimplement the whole thing in C++ for example, in case I’ll experience performance issues. I want the backup archives to be cross platform and easily usable with any technology.
This is why I will go with a JSON file as my database. It also allows me to easily inspect the backups and debug my application because it is human readable. The only thing required will be a text editor and every platform has one.
I will to go with one file that’ll describe the complete backup archive and one JSON file for each individual item of the archive. An item is either a full or an incremental backup. The main database file will be located at the root of the storage location and will be named “archives.json”. It’ll contain the name of all the individual archives, their type and in case of incremental backups, the previous backup they are based on. This way the backup application will know the order of things and can determine the latest version of a backed-up file. Alongside this database will be the folders of the individual backups.
The JSON file describing a single archive will be named “contents.json”. It’ll be placed alongside a folder named “contents” that will contain the backed-up data. I choose to us another folder here in order to avoid any potential naming collisions of the “contents.json” file of the backup application and a similarly named file that is being backed up. What it will contain is a list of all files and the most important attributes that shall help decide whether it has changed or not.
- File name
- Last changed timestamp
- Hash of the data
- Hash of the individual path components
- Array of path components
The first items on that list are pretty simple to understand. The application will use size and the last-changed timestamp to quickly determine if a change in the contents of the file could have happened. Only when those differ from the current file’s attributes a hash will be calculated to be certain of that. In order to keep the JSON structure simple, a file’s path components (including the file name) are hashed and used as a lookup key. I won’t use the raw path because that would tie it to an operating system. Omitting the path separator in the calculation applies to all systems and allows to use a backup across multiple OS – provided that the filesystems would use the same or at least compatible encodings (ignoring that NTFS excludes a lot of characters that are supported on APFS or ext4 and the like). I will also store an array of a file’s path components so that I can determine the location when I look at the database entry.
In summary, here’s a depiction of how it will probably look like.
|- archives.json |- full_YYYY-MM-DD_HH-MM-SS (backup-app folder) |- contents.json |- contents (backup-app folder) |- file1.zip |- file2.zip |- folder1 |- ... |- inc_YYYY-MM-DD_HH-MM-SS (backup-app folder) |- contents.json |- contents (backup-app folder) ...
Changes in the design may still happen, based on technical obstacles I may encounter once I’m implementing it or because it turns out that this approach is not viable at all.
As already hinted in a previous paragraph, I plan to implement this application in Java. For one I need the practice and secondly it allows to run the same binary on any operating system without recompilation. Java 11 will be used as a development platform and the Spring framework as a basis for application structure – for lack of another word. But I’m sure you know what I’m getting at.
If it turns out in the future that Java is a bottleneck, I may consider reimplementing this in C++, depending on how bad it is. Right now, I can’t imagine there being a big issue since most of the work will be IO-bound anyway.
That’s a story for another time. This piece is long enough already. Besides, I don’t have anything yet 😉So stay tuned. It’ll be Open Source and hosted on GitHub, for anybody to follow, check, complain, laugh… (I dare you 💪🏻)