On Wednesday 05 January 2011 08:13:33 Silver Salonen wrote: > On Tuesday 04 January 2011 21:40:54 Kern Sibbald wrote: > > Hello, > > > > On Tuesday 04 January 2011 19:45:53 Radosław Korzeniewski wrote: > > > 2011/1/1 Kern Sibbald <k...@sibbald.com> > > > > > > Hello Kern and others, > > > > > > > The first thing that one must do is specify what problem of > > > > deduplication one > > > > is trying to resolve: > > > > > > > > 1. Deduplication by the Bacula Storage daemon > > > > > > > > 2. Deduplication in the Bacula Client (File daemon) > > > > > > > > 3. Deduplication by the underlying filesystem where the SD writes > > > > data (e.g. > > > > ZFS). > > > > > > 4. Global deduplication performed on File Daemon but with dictionary > > > maintained on Bacula Director/Storage Daemon > > > - backup of particular data block isn't performed when SD already has a > > > such data block, no matter which client is an original owner of the > > > block - reduces data stored on SD like p.1 or p.3 approaches AND > > > reduces network traffic like p.2 approach > > > > > > Use case: A company has one production database (or vm image file) and > > > multiply test/development environments, all with backup. In most cases > > > difference between all of those databases (vm images) is less then 1% > > > of data blocks. During backup only 1% of data blocks is backuped and > > > send through network. > > > > Yes. I had considered this to be one of two options of item 2. The > > dedup hashes are either kept on the FD or on the Director. > > I hope that you mean by keeping hashes on Director you mean actually > keeping them on both?
We are considering every possibility -- each solution has its own advantages and disadvantages, so it is very hard to say that one way of doing this is the correct or right way. For example, it is faster to dedupicate if the hashes are stored on the client machine than if they are stored on a server such as the Director, but not every client machine has enough disk space to store them. Most estimates indicate that about 30% more disk space is required to keep hash codes. In addition, your deduplication ratio will very significantly drop (be very poor) if you are only deduping a client machine and do not use a deduplication "pool" of hashes from multiple machines. Unless you run tests, which may vary from machine to machine, it is very difficult to know what algorithm is best. One major factor is that the machine might be connected to a server by a very slow 100Mb Internet connection or a fast 10Gb LAN. We will probably start with something very simple and add to it over time. > > The first deduplication should be done by client (fast checks without > network activity) and after client does not find any matches, the possibly > new hashes are re-checked with Director. Kern ------------------------------------------------------------------------------ Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel