On 14:45, Kern Sibbald wrote: > > - keys are restricted to strings. it would be easy to extend this > > to arbitrary data by adding a "size" field. > > Yes, good idea, though I don't immediately see a need to handle anything > other > than strings within Bacula.
Well, one could hash the _contents_ of the files as well to detect renames. > > > hlink *walkptr; /* table walk pointer */ > > > > This is only used in first() and next(). IMHO the code would be > > more readable if this were local to these two methods and > > mext() would take a pointer to the predecessor. > > I think what you are suggesting is possible, and I will take a look at it. > However, it leads to several problems: > 1. walkptr and walk_index are then no longer treated the same, and IMO that > will make it harder to read the code. I.e. there are 2 pieces of information > that would have to be returned and then passed into next(). Yes, that's right, so a struct containing these two pieces of information would do the trick. Additional benefit is that this would make the code thread safe. > 2. There will be additional code necessary to adjust the pointer on entry, > which I currently avoid by keeping walkptr in the class. I don't quite follow. Which adjustments you are seeing here? BTW: The next() method only works as expected if there are no inserts in between. I think this isn't an obvious fact and should be documented. While thinking a bit more on it, it would probably be a good thing to replace first() and next() by a new function for_each_key() that takes a pointer to a function to be called for each entry of the hash table. That way, nothing of the internal details needs being exposed and walk_ptr/walk_index could still be local. > 3. It complicates the "user" code in that it must pass the old pointer to the > next() routine and if it is symmetric also walk_index. That's a of course true. And it raises the question what to do if the "user" passes outdated information to next(). > > This ist just shifting and multiplication with the bytes in key. I > > would expect this hash to have quite some unneccessary collisions > > when used on strings, though I'm by no means an expert in this area. > > If I am not mistaken, this is a *very* common technique. Yes it is, the question is whether this simple hashing method performs better than a more sophisticated function (which burns more CPU cycles for computing the hash but might save time afterwards because the keys are better distributed). We'll definitely need some tests with real world data here. > > This is a constant, known at compile time, so make it a preprocessor > > define using offsetof. > > Uh, what is "This is a constant"? I don't see any constant. loffset depends > on how one calls the routines. Rather than using a C++ template, which > duplicates the code for each different structure that contains a hash link, > Bacula computes the offset of the link in the structure, and thus with the > same source code can handle any number of different user structures that have > hash links. I have done the same for alist (allocated list) and dlist > (doubly linked lists). The downside of this technique is that the higher > level code must cast because it passes void pointers to the structures. > > It is a bit of a tricky point, so I'm not sure the above is clear. Thanks for the explanation, I think I understand now, please correct me if I'm wrong. The user is expected to define a structure containing at least two members: A pointer to an hlink structure and a pointer to the data to be hashed. And yes, you're right, loffset is not a constant if the htable code aims to work with arbitrary such structures. But for _each_ such structure it's a constant, so I think one could get rid of the casts by using some preprocessor magic and offsetof(). For example, if the user struct is struct my_struct { struct my_item *item; struct hlink *link; ... }; struct my_struct *my_array = malloc(42 * sizeof(*my_array)); the user would call HASH_INIT(my_struct, link, item, 42); where HASH_INIT would be #define HASH_INIT(type, link_member, item_member, size) \ hash_init(offsetof(type, link_member) - offsetof(type, item_member), size) and hash_init could be simplified to void htable::init(int offset, int tsize) { ... loffset = offset ... } No casts and type safe. > > More importantly, I think it is not optimal > > to store that many keys in the table because that means we'll have > > in average four hits per bucket. I'd rather choose max_items _smaller_ > > than the number of buckets. For instance the git code which uses hash > > tables extensively, grows its table whenever it is filled 66%. > > > I'd like to see some tests on this. I'm not convinced that the % full is a > good criterion because you could then have one bucket that has 100 items in > it which could be catastrophic. Well, that percentage in git is just number of items/size of the array (git uses open addressing). > > Open addressing might be better than chaining to deal with > > hash collisions. In particular if items are never (or rarely) > > removed from the table. > > Open addressing has the big advantage that it keeps entries together which > improves CPU speed since everything is in the cache, but unless I am missing > something, open addressing is extremely expensive in terms of memory > requirements since the table has to always be the maximum size. This could > be reasonable if we know how many items are going to be added, but if we do > not, I'm not sure. I don't think it's necessarily that expensive. For instance if one has a hash table with space for 2 * N objects, at the time the Nth object is going to be stored, the size will be doubled. If hash collisions occur, the first free entry in the table is used. There are better methods for selecting a free slot. So yes there is some waste in memory, but depending on the situation, it might be affordable. > I would like to see some real performance studies before changing. Me too. I'll see if I can code up something. Probably not before next week though. > What always amuses me when I look back at code like this is that I wrote it 4 > years ago to be able to implement the base file project (de-duplication) and > the project to track files added/removed from the system, but I've been so > busy with other priority projects that I have not had time to get back and > actually use these routines. LOL > The same happened for my red/black tree (src/lib/rblist.c/h), except in that > case, I have now actually put the code to use in the restore code giving a > 513 times speed up in version 2.1.x compared to 2.0.x for 800,000 files. Yeah, rbtrees are great. I use them in one of my pet projects too. But I took the easy route and copied the rbtree implementation of the Linux kernel ;) Regards Andre -- The only person who always got his work done by Friday was Robinson Crusoe
signature.asc
Description: Digital signature
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users