HelloI was reading this and had a thought about deduplication, The zfs filesytem has inbuilt deduplication (and compression) supportso you could when creating a new backup volume create a virtual zfs pool/filesystem Write all backuped files to the zfs poolWhich automatically does deduplication You then write the virtual zfs file system to your bacula volumeThough Not sure how well this would work in practice, but seems like a "simple" way to implement basic deduplication Christopher tyermanSent from my Galaxy -------- Original message --------From: egoitz--- via Bacula-devel <bacula-devel@lists.sourceforge.net> Date: 03/03/2022 12:36 (GMT+00:00) To: Radosław Korzeniewski <rados...@korzeniewski.net> Cc: bacula-devel@lists.sourceforge.net Subject: Re: [Bacula-devel] Open source Bacula plugin for deduplication Hello Radoslaw,
I will answer below in green color for instance... just for discerning better what both have spoke... :) El 2022-03-03 12:46, Radosław Korzeniewski escribió: ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. Hello, czw., 3 mar 2022 o 12:09 egoitz--- via Bacula-devel <bacula-devel@lists.sourceforge.net> napisał(a): Good morning, I know Bacula enterprise provides deduplicacion plugins, but sadly we can't afford it. No problem, we will try to create an open source deduplication plugin for bacula file daemon. I would use rdiff (part of librsync) for delta patching and signature generation. What signatures rdiff is using? Basically here is documented exactly... https://librsync.github.io/page_formats.html It's for being able to generate delta patches, without the need of having old and new version of a file... and so... for avoid doubling the space used or required for backing up... I would love to create a Bacula plugin for deduplicating content at fd level. This way, even if the backup is sent crypted by fd to sd, the deduplication could be done obtaining the best results as the deduplication takes place when the files are not crypted yet. Yes, for proper encryption you would always get different bits for the same data block making deduplication totally useless. :) I think that too.. yes... The deduplication, would only be applied to files, let's say larger than 10GB. ??? I designed Bacula deduplication to handle blocks (files) larger than 1k because indexing overhead for such small blocks was too high. The larger the block you use the lower chance to get a good deduplication ratio. So it is a trade-off - small blocks == good deduplication ratio but higher indexing overhead; larger blocks == weak deduplication ratio but lower indexing overhead. So it was handling block levels from 1K up to 64k (the default bacula block size, but could be extended to any size). I understand what you say but the problem we are facing is the following one. Imagine, a machine with a SQL Server and 150GB of databases. Our problem is to have to incrementally copy that each day. We don't really mind copying 5GB of "wasted" space per day... even when non necessary (just for understanding).... but obviously 100GB per day or 200GB... are different terms.... I was thinking in applying this deduplication only for important files really.... hope you can understand me now.. :) If you don't mind, I would like to share with you my ideas, in order to at least know, "this all" is a possible way. My idea is basically : - When doing a backup : ++ Check the backup level we are running. I suppose that asking bVarLevel to getBaculaValue() Deduplication should be totally transparent to the backup level. You want to deduplicate data, especially for largest full level backups, right? Well... really... the problem for us is what I told just before so... We don't really mind copying a big file once a month, but we want to avoid copying it in incremental backups (at least the whole of the file...). Apart, when restoring (and not in virtual backups), you restore a full plus incrementals. So this way, we would restore the full ORIGINAL_FILE plus the patches and we would apply them to ORIGINAL_FILE at the end of the restoring job. ++ In startBackupFile() I suppose it gives me file size info (or if at least gives me the name and I'll do an stat() in some manner), get the file size. No. The standard "Bacula command Plugin API" expects that a plugin will return a file stat info to backup. Ok, no problem... if I get in some manner filename and path I could always do a stat() +++If it's a full level and bigger than 10GB, obtain the file signature and finally store that new (previously non existing) signature (written in a file with a known nomenclature based on ORIGINAL_FILE's name), plus the whole ORIGINAL_FILE (the one we have generated the signature from) in Bacula tapes. Should I need to say to Bacula, to re-read the directory for being able to backup generated file signatures?. They weren't until know we have generated a file that contains ORIGINAL_FILE signature. Why do you call it a "deduplication plugin"? Above is a functionality described by the Delta plugin which supports so-called "block level incremental". Which is _NOT_ deduplication. This "block level incremental" tries to backup blocks inside a single file which changed between backups. It does not deduplicate the backup stream in any sense. For two identical files which change in the same way Delta plugin will do data backup twice leaving data duplication in place. Yes mate, you are right. What I needed is to avoid uploading to backup each day big files with very little changes. Not to avoid writting two equal files in the backup. In the case of the Delta plugin which uses the exact procedure and library which you describe above you should use an "Option Plugin API". I see. I'll read about it... +++If it's an inc level and a previous signature of ORIGINAL_FILE file exists (I would know because they will have a known nomenclature based on ORIGINAL_FILE's name), with the previous signature plus the new state of the file (the new file state I mean), create a patch. Later obtain again, the file signature in the new status. Finally store that new signature plus the patch in Bacula tapes. Finally return a bRC_Skip of the ORIGINAL_FILE (because we are going to copy a delta patch and a signature). If I return a bRC_Skip to here... would the fd, skip this file, but see the signatures and delta patches generated before retuning the bRC_Skip?. Or should I ask to fd, in some manner, to re-read the directory?. It sounds like an exact step by step description of the Delta plugin. So, now I understand why you want to handle files > 10G only. :) Thats it :) :) As you would assume in the incremental backups, I'm not storing the filename as its in the filesystem. It should more or less the following way : In a full level backup : ++ BEFORE THE BACKUP : BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT ORIGINAL_FILE <---> ++ AFTER THE BACKUP : BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT ORIGINAL_FILE + SIGNATURE FILE <---> ORIGINAL FILE + SIGNATURE FILE In the next incremental level backup : ++ BEFORE THE BACKUP : BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE GENERATED THE LAST FULL DAY <---> FROM THE FULL BACKUP(ORIGINAL FILE + SIGNATURE FILE) ++ AFTER THE BACKUP : BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE OF NEW_STATE_ORIGINAL_FILE <---> FROM THE FULL BACKUP(ORIGINAL FILE + SIGNATURE FILE) + PATCH FILE + SIGNATURE FILE OF NEW_STATE_ORIGINAL_FILE - When restoring a backup : If the restored files nomenclature is (for example...) ORIGINAL_FILE-SIGNATURE- OR ORIGINAL_FILE-PATCH that would mean (I assume I could see in the filename to be restored in startRestoreFile() because it has accesible the filename), we have backed up deltas of ORIGINAL_FILE in the incremental backups. So, let's write to a plain text file with this path inside it, in order for later, in a post restore job (or even bEventEndBackupJob event of the api?), to apply the patches in that path, to the ORIGINAL_FILE obtainted from the own name of the patch files. Finally after patching job done, remove signature files and patch files. Obviously leaving the last status of ORIGINAL_FILE at the restored date. So, at this point, I would be very very thankful :) :) :) if some experienced developer, could give me some idea or if can see something is wrong or should achieved in some other manner or with other plugin api functions..... IMVHO, the Delta plugin should be best handled with "Options Plugin API" (as it is with current Delta Plugin) and not the "Command Plugin API" as most of the backup functionality will be provided by Bacula itself. I will read about this too.... best regards BTW. I think a Delta plugin available in BEE is fairly cheap compared to full deduplication options. I have asked price to Rob Morrison :) :) Cheers!!! -- Radosław korzeniewskirados...@korzeniewski.net
_______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel