Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

Robert LeBlanc Fri, 28 May 2010 10:33:24 -0700

On Fri, May 28, 2010 at 10:48 AM, Eric Bollengier <
eric.bolleng...@baculasystems.com> wrote:

First, thank you for the kind replies, this is helping me to ensure I see
the big picture.

Le vendredi 28 mai 2010 16:42:01, Robert LeBlanc a écrit :
> > On Fri, May 28, 2010 at 12:32 AM, Eric Bollengier <
> >
> > eric.bolleng...@baculasystems.com> wrote:
> > > Hello Robert,
> > > What would be the result if you do Incremental backup instead of full
> > > backup ?
> > > Imagine that you have 1% changes by day, it will give something like
> > > total_size = 30GB + 30GB*0.01 * nb_days
> > > (instead of 30GB * nb_days)
> > >
> > > I'm quite sure it will give a "compression" like 19:1 for 20 backups...
> > >
> > > This kind of comparison is the big argument of dedup companies, "do 20
> > > full backup, and you will have 20:1 dedup ratio", but do 19 incremental
> > > + 1 full and this ratio will fall down to 1:1... (It's not exactly true
> > > neither because
> > > you can save space with multiple systems having same data)
> >
> > The idea was to in some ways simulate a few things all at once. This kind
> > of test could show how multiple similar OSes could dedupe (20 Windows OS
> > for example, you only have to store those bits once for any number of
> > Windows machines), using Bacula's incrementals, you have to store the
> bits
> > once per machine
>
> In this particular case, you can use the BaseJob file level deduplication
> that
> allows you to store only one version of each OS. (But I admit that if the
> system can do it automatically, it's better)
>

I agree, I haven't looked into BaseJobs yet because they are not the easiest
thing to understand. Since I've very pressed for time, I don't have a lot of
time to commit to reading. I plan on understanding it, but when a system can
do it automatically and transparently, I like that a lot.

> > and then again when you do your next full each week or month.
>
> Why do you want to schedule Full backup every weeks? With Accurate option,
> you
> can adopt the Incremental forever (Differential can limit the number of
> incremental to use for restore)
>
> If it's to have multiple copies of a particular file (what I like to advise
> when using tapes), since the deduplication will turn multiple copies to a
> single instance, I think that it's very similar.
>

We are using accurate jobs on a few machines, however, I have not scheduled
the roll-ups yet as I haven't had time to read the manual enough. I need to
do it soon as I have months of incrementals without any fulls in between. I
do like having multiple copies of my files on tape, on disk not so much. The
reason is I've had tapes go bad, with disk, I have a lot of redundancy built
in.

> It also was to show how much you could save when doing your fulls
> > each week or month, a similar effect would happen for the differentials
> > too. It wasn't meant to be all inclusive, but just to show some trends
> > that I was interested in.
>
> Yes, but comparing 20 full backup with 20 full copies with deduplication is
> like comparing apples and oranges... At least, it should appear somewhere
> that
> you choose the worst case for bacula and the best case for deduplication
> :-)
>

Please remember that the bacula tape files were on a lessfs file system, so
the same amount of data was written using rsync and bacula, just different
formats on lessfs. So best case senario is that they should have had the
same dedupe rate. The idea was to see how both formats faired on lessfs.

> > In our environment, since everything is virtual,
> > we don't save the OS data, and only try to save the minimum that we need,
> > that doesn't work for everyone though.
>
> Yes, this is an other very common way to do, and I agree that sometime you
> can't do that.
>
> It's also very practical to just rsync the whole disk and let LessFS do
> it's
> job. If you want to browse the backup, it's just a directory. With Bacula,
> as
> incremental/full/differential are presented in a virtual tree, it's not
> needed.
>

Understandable, in a disaster recovery instance with Bacula, if the on disk
format was a tree, you could browse to the lastest backup of your catalog
and import it and off you go. Right now, I have no clue which of the 100
tapes I have has the latest catalog backup, I would have to scan them all,
and if the backup spans tapes, I have to figure out what order to scan the
tapes to recover the back-up, that could take forever. Now, that I've
thought about it, I think it's time for a new pool for catalog backups,
sigh.

> I think in some ways, each dedupe file system can work very well with each
> > file as it's own instead of being in a stream. That way the start of the
> > file is always on a boundary that the deduplication file system uses. I
> > think you might be able to use sparse files for a stream and always
> sparse
> > up the block alignment,
>
> I'm not very familiar with sparse file, but I'm pretty sure that the
> "sparse
> unit" is a block. So, if a block is empty ok, but if you have some bytes
> used
> inside this block, it will take 4KB.
>

I'm not an expert with sparse files, so I'm not sure what the limitations
are. My experience is with VM where a sparse file is created. The file has
all the space allocated to it, but does not actually take space on the fs.
How much "fast-forwarding" you can do in a sparse file, I'm not sure, but
quite a bit as evidenced by it's use with VMs. I'm thinking of the Bacula
sparse file format to be like a VM sparse disk. I guess you could put an FS
in the sparse file and that should handle aligment to a point, but is seems
like a lot of overhead to just encapsulate the data.

> that would make the stream file look really large
> > compared to what it actually uses on a non deduped file system. I still
> > think if Bacula lays the data down in the same file structure as on the
> > client organized by jobID with some small bacula files to hold
> permissions,
> > etc I think it would be the most flexible for all dedupe file systems
> > because it would be individual files like they are expecting.
>
> Yes, this was a way to do, but we still have the problem for alignment and
> free space in blocks. If I'm remember well, LessFS uses LZO to compress
> data,
> so we can imagine that a 4KB block with only 200 bytes should be very small
> at
> the end. This could be a very interesting test, just write X blocks with
> 200
> bytes (random), and see if it takes X*4KB or ~ X*compress(200bytes).
>
> It will allows also to store metadata in special blocs. So the basic
> modification will be to start all new file data stream in a new block :)
>

I don't know the details, but maybe a lessfs guy could clarify this some.

> > > The compression on disk is better, on the network layer and the remote
> IO
> > > disk
> > > system, this is an other story. BackupPC is smarter on this part (but
> > > have problems with big set of files).
> >
> > I'm not sure I understand exactly what you mean. I understand that
> BacupPC
> > can cause a file system to not mount because it exhausts the number of
> hard
> > links the fs can support.
>
> Yes, this is true (at least on ext3). What I'm saying is that rsync to a
> new
> directory, you will have to read the entire disk (30GB in your case), and
> transmit it over the network. With an incremental, you just read and
> transfer
> modified data. (1 to 10% of the 30GB)
>

I'm not sure for backuppc, but it can certainly avoid to transfer file if
> they
> have not changed.
>

Yes, in real world, I would not rsync into a new directory, it was just to
have a similarity with the full backup that Bacula was doing and see how
well both methods would dedup compared to each other.

> Luckly, with deduplication file system, you don't
> > have this problem, because you just copy the bits and the fs does the
> work
> > of finding the duplicates. A dedupe fs can even only store a small part
> of
> > a file (if most of the file is duplicate and only a small part is unique)
> > where BackupPC would have to write that whole file.
>
> Yes, for sure. Did you have an idea of which kind of file have only few
> bytes
> that change over the time ? (Database file, C/C++ files, ...). For example,
> big openoffice file are compressed, and data can change almost everywhere.
>

Mostly database type files (logs especially), system type log files,
uncompressed tiff files (we have a lot of those), large DNA sequences, etc.
Most files will change a significant portion of the file when modified.

> One of the reasons I mentioned if it could be implemented. If there is
> > anything I know about OSS, is that there are some amazing people with an
> > ability to think so outside the box that these things have not been able
> to
> > stop the progress of OSS.
>
> One thing can stop progress of OSS, it's Software Patents... By chance,
> most
> of the Bacula code is written in Europe and the copyright is owned by FSF
> Europe where Software Patents are not valid, but who knows what software
> lobby
> can do...
>

I have recently been torn with a company who has some good innovations and
hardware, but who's political agenda is to send a take-down for any little
patent that is infringed (and not to everyone, but is
targeting certain companies). I like the hardware, but I don't like their
stance on patents when they have stolen their fair share in the past. It's a
double standard that really bugs me.

Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Bacula tape format vs. rsync on deduplicated file systems

Reply via email to