Re: [Bacula-devel] Open source Bacula plugin for deduplication

webmaster via Bacula-devel Thu, 03 Mar 2022 06:25:00 -0800

HelloI was reading this and had a thought about deduplication, The zfs 
filesytem has inbuilt deduplication (and compression) supportso you could when 
creating a new backup volume create a virtual zfs pool/filesystem Write all 
backuped files to the zfs poolWhich automatically does deduplication You then 
write the virtual zfs file system to your bacula volumeThough Not sure how well 
this would work in practice, but seems like a "simple" way to implement basic 
deduplication Christopher tyermanSent from my Galaxy
-------- Original message --------From: egoitz--- via Bacula-devel 
<bacula-devel@lists.sourceforge.net> Date: 03/03/2022  12:36  (GMT+00:00) To: 
Radosław Korzeniewski <rados...@korzeniewski.net> Cc: 
bacula-devel@lists.sourceforge.net Subject: Re: [Bacula-devel] Open source 
Bacula plugin for deduplication 
Hello Radoslaw,


I will answer below in green color for instance... just for discerning better 
what both have spoke... :)
 

El 2022-03-03 12:46, Radosław Korzeniewski escribió:




ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche 
en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa 
que el contenido es seguro.





Hello,


czw., 3 mar 2022 o 12:09 egoitz--- via Bacula-devel 
<bacula-devel@lists.sourceforge.net> napisał(a):


Good morning,

I know Bacula enterprise provides deduplicacion plugins, but sadly we can't 
afford it. No problem, we will try to create an open source deduplication 
plugin for bacula file daemon. I would use rdiff (part of librsync) for delta 
patching and signature generation.


What signatures rdiff is using? 
 
 
 
Basically here is documented exactly... 
https://librsync.github.io/page_formats.html
 
It's for being able to generate delta patches, without the need of having old 
and new version of a file... and so... for avoid doubling the space used or 
required for backing up...
 
 


I would love to create a Bacula plugin for deduplicating content at fd level. 
This way, even if the backup is sent crypted by fd to sd, the deduplication 
could be done obtaining the best results as the deduplication takes place when 
the files are not crypted yet.


Yes, for proper encryption you would always get different bits for the same 
data block making deduplication totally useless. :) 
 
I think that too.. yes... 


The deduplication, would only be applied to files, let's say larger than 10GB.


???
 
 
I designed Bacula deduplication to handle blocks (files) larger than 1k because 
indexing overhead for such small blocks was too high. The larger the block you 
use the lower chance to get a good deduplication ratio. So it is a trade-off - 
small blocks == good deduplication ratio but higher indexing overhead; larger 
blocks == weak deduplication ratio but lower indexing overhead. So it was 
handling block levels from 1K up to 64k (the default bacula block size, but 
could be extended to any size).
 
 
I understand what you say but the problem we are facing is the following one. 
Imagine, a machine with a SQL Server and 150GB of databases. Our problem is to 
have to incrementally copy that each day. We don't really mind copying 5GB of 
"wasted" space per day... even when non necessary (just for understanding).... 
but obviously 100GB per day or 200GB... are different terms....
 
I was thinking in applying this deduplication only for important files 
really.... hope you can understand me now.. :)
 
 


If you don't mind, I would like to share with you my ideas, in order to at 
least know, "this all" is a possible way.

My idea is basically :

- When doing a backup : 

++ Check the backup level we are running. I suppose that asking bVarLevel to 
getBaculaValue()


Deduplication should be totally transparent to the backup level. You want to 
deduplicate data, especially for largest full level backups, right?
 
 
Well... really... the problem for us is what I told just before so... We don't 
really mind copying a big file once a month, but we want to avoid copying it in 
incremental backups (at least the whole of the file...). Apart, when restoring 
(and not in virtual backups), you restore a full plus incrementals. So this 
way, we would restore the full ORIGINAL_FILE plus the patches and we would 
apply them to ORIGINAL_FILE at the end of the restoring job.
 


++ In startBackupFile() I suppose it gives me file size info (or if at least 
gives me the name and I'll do an stat() in some manner), get the file size.


No. The standard "Bacula command Plugin API" expects that a plugin will return 
a file stat info to backup. 
 
Ok, no problem... if I get in some manner filename and path I could always do a 
stat()


+++If it's a full level and bigger than 10GB, obtain the file signature and 
finally store that new (previously non existing) signature (written in a file 
with a known nomenclature based on ORIGINAL_FILE's name), plus the whole 
ORIGINAL_FILE (the one we have generated the signature from) in Bacula tapes. 
Should I need to say to Bacula, to re-read the directory for being able to 
backup generated file signatures?. They weren't until know we have generated a 
file that contains ORIGINAL_FILE signature.


Why do you call it a "deduplication plugin"? Above is a functionality described 
by the Delta plugin which supports so-called "block level incremental". Which 
is _NOT_ deduplication. This "block level incremental" tries to backup blocks 
inside a single file which changed between backups. It does not deduplicate the 
backup stream in any sense. For two identical files which change in the same 
way Delta plugin will do data backup twice leaving data duplication in place.
 
Yes mate, you are right. What I needed is to avoid uploading to backup each day 
big files with very little changes. Not to avoid writting two equal files in 
the backup.
 
In the case of the Delta plugin which uses the exact procedure and library 
which you describe above you should use an "Option Plugin API".
 
I see. I'll read about it...


+++If it's an inc level and a previous signature of ORIGINAL_FILE file exists 
(I would know because they will have a known nomenclature based on 
ORIGINAL_FILE's name), with the previous signature plus the new state of the 
file (the new file state I mean), create a patch. Later obtain again, the file 
signature in the new status. Finally store that new signature plus the patch in 
Bacula tapes. Finally return a bRC_Skip of the ORIGINAL_FILE (because we are 
going to copy a delta patch and a signature). If I return a bRC_Skip to here... 
would the fd, skip this file, but see the signatures and delta patches 
generated before retuning the bRC_Skip?. Or should I ask to fd, in some manner, 
to re-read the directory?.


It sounds like an exact step by step description of the Delta plugin.
 
So, now I understand why you want to handle files > 10G only. :) 
 
Thats it :) :)



As you would assume in the incremental backups, I'm not storing the filename as 
its in the filesystem. It should more or less the following way :

In a full level backup :
++ BEFORE THE BACKUP  :
BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT
ORIGINAL_FILE            <---> 
++ AFTER THE BACKUP : 
BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT
ORIGINAL_FILE + SIGNATURE FILE           <--->  ORIGINAL FILE + SIGNATURE FILE

In the next incremental level backup :
++ BEFORE THE BACKUP  :
BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT
NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE GENERATED THE LAST FULL DAY  <--->  
FROM THE FULL BACKUP(ORIGINAL FILE + SIGNATURE FILE)
++ AFTER THE BACKUP :  
BACKED SERVER'S FS <----> BACULA "VIRTUAL TAPE" CONTENT
NEW_STATE_ORIGINAL_FILE + SIGNATURE FILE OF NEW_STATE_ORIGINAL_FILE <--->  FROM 
THE FULL BACKUP(ORIGINAL FILE + SIGNATURE FILE) + PATCH FILE + SIGNATURE FILE 
OF NEW_STATE_ORIGINAL_FILE

- When restoring a backup :
If the restored files nomenclature is  (for example...) 
ORIGINAL_FILE-SIGNATURE- OR ORIGINAL_FILE-PATCH that would mean (I assume I 
could see in the filename to be restored in startRestoreFile() because it has 
accesible the filename), we have backed up deltas of ORIGINAL_FILE in the 
incremental backups.
So, let's write to a plain text file with this path inside it, in order for 
later, in a post restore job (or even bEventEndBackupJob event of the api?), to 
apply the patches in that path, to the ORIGINAL_FILE obtainted from the own 
name of the patch files. Finally after patching job done, remove signature 
files and patch files. Obviously leaving the last status of ORIGINAL_FILE at 
the restored date.

So, at this point, I would be very very thankful :) :) :) if some experienced 
developer, could give me some idea or if can see something is wrong or should 
achieved in some other manner or with other plugin api functions.....


IMVHO, the Delta plugin should be best handled with "Options Plugin API" (as it 
is with current Delta Plugin) and not the "Command Plugin API" as most of the 
backup functionality will be provided by Bacula itself.
 
I will read about this too....
 
best regards
 
BTW. I think a Delta plugin available in BEE is fairly cheap compared to full 
deduplication options. 
 
I have asked price to Rob Morrison :) :)
 
Cheers!!!

-- 
Radosław korzeniewskirados...@korzeniewski.net

_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Open source Bacula plugin for deduplication

Reply via email to