Re: [Bacula-devel] De-duplication friendly Volume Format change

Kern Sibbald Tue, 04 Jan 2011 23:56:48 -0800

On Wednesday 05 January 2011 08:13:33 Silver Salonen wrote:
> On Tuesday 04 January 2011 21:40:54 Kern Sibbald wrote:
> > Hello,
> >
> > On Tuesday 04 January 2011 19:45:53 Radosław Korzeniewski wrote:
> > > 2011/1/1 Kern Sibbald <k...@sibbald.com>
> > >
> > > Hello Kern and others,
> > >
> > > > The first thing that one must do is specify what problem of
> > > > deduplication one
> > > > is trying to resolve:
> > > >
> > > > 1. Deduplication by the Bacula Storage daemon
> > > >
> > > > 2. Deduplication in the Bacula Client (File daemon)
> > > >
> > > > 3. Deduplication by the underlying filesystem where the SD writes
> > > > data (e.g.
> > > > ZFS).
> > >
> > > 4. Global deduplication performed on File Daemon but with dictionary
> > > maintained on Bacula Director/Storage Daemon
> > > - backup of particular data block isn't performed when SD already has a
> > > such data block, no matter which client is an original owner of the
> > > block - reduces data stored on SD like p.1 or p.3 approaches AND
> > > reduces network traffic like p.2 approach
> > >
> > > Use case: A company has one production database (or vm image file) and
> > > multiply test/development environments, all with backup. In most cases
> > > difference between all of those databases (vm images) is less then 1%
> > > of data blocks. During backup only 1% of data blocks is backuped and
> > > send through network.
> >
> > Yes.  I had considered this to be one of two options of item 2.  The
> > dedup hashes are either kept on the FD or on the Director.
>
> I hope that you mean by keeping hashes on Director you mean actually
> keeping them on both?


We are considering every possibility -- each solution has its own advantages 
and disadvantages, so it is very hard to say that one way of doing this is 
the correct or right way.  

For example, it is faster to dedupicate if the hashes are stored on the client 
machine than if they are stored on a server such as the Director, but not 
every client machine has enough disk space to store them.  Most estimates 
indicate that about 30% more disk space is required to keep hash codes.  In 
addition, your deduplication ratio will very significantly drop (be very 
poor) if you are only deduping a client machine and do not use a 
deduplication "pool" of hashes from multiple machines.

Unless you run tests, which may vary from machine to machine, it is very 
difficult to know what algorithm is best.  One major factor is that the 
machine might be connected to a server by a very slow 100Mb Internet 
connection or a fast 10Gb LAN.

We will probably start with something very simple and add to it over time.

>
> The first deduplication should be done by client (fast checks without
> network activity) and after client does not find any matches, the possibly
> new hashes are re-checked with Director.

Kern



------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] De-duplication friendly Volume Format change

Reply via email to