On Tue, 2022-11-08 at 15:07 +0100, hede wrote: > On 08.11.2022 05:31, hw wrote: > > That still requires you to have enough disk space for at least two full > > backups. > > Correct, if you do always full backups then the second run will consume > full backup space in the first place. (not fully correct with bees > running -> *)
Does that work? Does bees run as long as there's something to deduplicate and only stops when there isn't? I thought you start it when the data is place and not before that. > That would be the first thing I'd address. Even the simplest backup > solutions (i.e. based on rsync) do make use of destination rotation and > only submitting changes to the backup (-> incremental or differential > backups). I never considered successive full backups as a backup > "solution". You can easily make changes to two full copies --- "make changes" meaning that you only change what has been changed since last time you made the backup. > For me only the first backup is a full backup, every other backup is > incremental. When you make a second full backup, that second copy is not incremental. It's a full backup. > Regarding dedublication, I do see benefits in dedublication either if > the user moves files from one directory to some other directory, in > partly changed files (my backup solution dedubes on file basis via > hardlinks only), and with system backups of several different machines. But not with copies? > I prefer file based backups. So my backup solutions dedublication skills > are really limited. But a good block based backup solution can handle > all these cases by itself. Then no filesystem based dedublication is > needed. What difference does it make wether the deduplication is block based or somehow file based (whatever that means). > If your problem is only backup related and you are flexible regarding > your backup solution, then probably choosing a backup solution with a > good dedublication feature should be your best choice. The solution > don't has to be complex. Even simple backup solutions like borg backup > are fine here (borg: chunk based deduplication even of parts of files > across several backups of several different machines). Even your > criteria to not write duplicate data in the first place is fulfilled > here. I'm flexible, but I distrust "backup solutions". > (see borgbackup in Debian repository; disclaimer: I do not have personal > experience with borg as I'm using other solutions) > > > I wouldn't mind running it from time to time, though I don't know that > > I > > would have a lot of duplicate data other than backups. How much space > > might I > > expect to gain from using bees, and how much memory does it require to > > run? > > Bees should run as a service 24/7 and catches all written data right > after it gets written. That's comparable to in-band dedublication even > if it's out-of-band by definition. (*) This way writing many duplicate > files will potentially result in removing duplicates even if not all > data has already written to disk. > > Therefore also memory consumption is like with in-band deduplication > (ZFS...), which means you should reserve more than 1 GB RAM per 1 TB > data. But it's flexible. Even less memory is usable. But then it cannot > find all duplicates as the hash table of all the data doesn't fit into > memory. (Nevertheless even then dedublication is more efficient than > expected: if it finds some duplicate block it looks for any blocks > around this block. So for big files only one match in the hash table is > sufficient to dedublicate the whole file.) Sounds good. Before I try it, I need to make a backup in case something goes wrong.