On 11:57, Kern Sibbald wrote: > Robert quit the project, so currently there is no one assigned to it. > > I would be *extremely* happy to see someone interested in this project.
I sure am interested. But I'm not sure how much time I will be able to spend on it. > If your offer is for algorithm help, please see Algorithms below. If > your offer above includes programming (i.e. C or C++ programmer), and > you are interested in working on it, please let me know (either off > list or if you wish copying the bacula-devel list) and we can discuss > the project. The theoretical background is what I'm interested (and probably skilled) most, but programming in C is not a problem either. I'm not that familiar with C++ though. > This project and the project to store only one copy of a file (Base project) > are closely related because they both require *much* more communication > between the Dir and the FD -- essentially the Dir must send the current state > as known in the catalog to the client, which can then determine which files > to backup. > > Algorithms: > This requires potentially sending a *lot* of data (i.e. millions of filenames > and attribute data), which will require hash coding the names for performance > reasons. Agreed. So the communication between sd and fd for an incemental backup would look like fd -> sd: here's the list of (hashes of) filenames that have changed. This list might also contain hashes of the contents of these files. sd -> fd: thanks, please send them, but omit the following ones as I already have them. sd -> fd: list of hashes to omit. fd -> sd: send contents of shortened list of files > Some years ago, I wrote hasing routines specifically for this, but they have > never been used yet, and so I am now looking at bringing them up to date -- > in particular adding a Bloom filter to improve performance (I am currently > researching Bloom filters). I just read the wikipedia page on bloom filters. Looks to me like a very promising concept. However, one obvious problem is that you'll have to regenerate the filter whenever the ratio n/m of number of hash bits per filename becomes so small that the false probability of (ln 2)^(n/m) becomes too large. > Where I could use a bit of advice is: > > Now: > - Reviewing my hash table code (particularly the hash function) > src/lib/htable.h src/lib/htable.c Will do. > - Proposing how to size a Bloom filter (n bits) and number of hash functions. Well, these values will have to depend on the number of files that are going to be stored in the database. If an upper bound M on this number is known in advance as well as a small number p, the probability with which we are willing to accept false positives, one can easily determine n and the number k of hash functions as n = M * ln (1 / p) / (ln 2)^2 k = ln 2 * n / M unless I errored with the math ;) > - Proposing what hash functions to use for the Bloom filter. I personally like SuperFastHash for its performance, see http://www.azillionmonkeys.com/qed/hash.html It's not cryptographically secure, but very fast. md5 would be another option, but sha1 is likely too expensive for no real gain. It should be possible to exchange the hash algorithm easily. In any case it's enough to choose only two hash functions h1 and h2, no matter what k is. One can then use the standard double hashing method to produce k hash functions via h(i, key) = (h1(key) + i * h2(key)) mod m, i = 0...k-1, where m is chosen properly (e.g. a power of two if h2 produces only odd numbers). > Since these two projects (de-duplication of files, tracking new and deleted > files) are quite hot topics lately, over the next week, I will write up a > sort of proposal for implementation outlining my general ideas for how to > implement them within the existing Bacula framework (i.e. without too many > modifications to the database, ...). I'm looking forward to this. Andre -- The only person who always got his work done by Friday was Robinson Crusoe
signature.asc
Description: Digital signature
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users