Re: [Bacula-users] [Bacula-devel] Project for strong incremental backup assurance

Kern Sibbald Tue, 12 Jun 2007 16:41:48 -0700

Hello,

Thanks for your comments.  Please see my notes below ...

On Monday 11 June 2007 10:24, Andre Noll wrote:
> On 15:35, Andre Noll wrote:
> 
> > > - Reviewing my hash table code (particularly the hash function)  
> > > src/lib/htable.h src/lib/htable.c
> > 
> > Will do.
> 
> First some general remarks:
> 
>       - check the return value of malloc()

Not necessary, malloc() is #defined to be bmalloc(), which aborts if it is out 
of memory.

>       - keys are restricted to strings. it would be easy to extend this
>         to arbitrary data by adding a "size" field.

Yes, good idea, though I don't immediately see a need to handle anything other 
than strings within Bacula.

>       - IMHO "%p" is prefered for printing pointer variables.

Yes, but it was not portable.  We now have our own Bacula printf code which is 
portable and supports %p, so I've been converting the old stuff when I am in 
the file.  Clearly this is a good file to attack.

> 
> --------------------------------------- htable.h
> 
> 
> > struct hlink {
> >    void *next;                        /* next hash item */
> 
> why not struct hlink *next?

Good question -- I think you are right.  I'll try it and see if anything 
complains ...

> 
> >    char *key;                         /* key this item */
> >    uint32_t hash;                     /* hash for this key */
> > };
> > 
> > class htable : public SMARTALLOC {
> >    hlink **table;                     /* hash table */
> >    int loffset;                       /* link offset in item */
> 
> see below.
> 
> >    uint32_t num_items;                /* current number of items */
> >    uint32_t max_items;                /* maximum items before growing */
> 
> see below.
> 
> >    uint32_t buckets;                  /* size of hash table */
> >    uint32_t hash;                     /* temp storage */
> >    uint32_t index;                    /* temp storage */
> >    uint32_t mask;                     /* "remainder" mask */
> >    uint32_t rshift;                   /* amount to shift down */
> >    hlink *walkptr;                    /* table walk pointer */
> 
> This is only used in first() and next(). IMHO the code would be
> more readable if this were local to these two methods and
> mext() would take a pointer to the predecessor.

I think what you are suggesting is possible, and I will take a look at it. 
However, it leads to several problems:
1. walkptr and walk_index are then no longer treated the same, and IMO that 
will make it harder to read the code.  I.e. there are 2 pieces of information 
that would have to be returned and then passed into next().
2. There will be additional code necessary to adjust the pointer on entry, 
which I currently avoid by keeping walkptr in the class.
3. It complicates the "user" code in that it must pass the old pointer to the 
next() routine and if it is symmetric also walk_index.

> 
> >    uint32_t walk_index;               /* table walk index */
> 
> same here.

As with the above, that will complicate the problem by returning it to the 
higher level program.  The "user" program really shouldn't need to know 
anything about walkptr or walk_index as they are simply the means to traverse 
a hash list in linear order without using recursion (I have never seen such 
code anywhere), so IMO, returning them to the higher level routine breaks the 
encapsulation of the code.

> 
> >    void hash_index(char *key);        /* produce hash key,index */
> >    void grow_table();                 /* grow the table */
> > public:
> >    htable(void *item, void *link, int tsize = 31);
> >    ~htable() { destroy(); }
> >    void init(void *item, void *link, int tsize = 31);
> >    bool  insert(char *key, void *item);
> >    void *lookup(char *key);
> >    void *first();                     /* get first item in table */
> >    void *next();                      /* get next item in table */
> >    void destroy();
> >    void stats();                      /* print stats about the table */
> >    uint32_t size();                   /* return size of table */
> > };
> 
> ----------------------------------------htable.c
> 
> 
> > /*
> >  * Create hash of key, stored in hash then
> >  *  create and return the pseudo random bucket index
> >  */
> > void htable::hash_index(char *key)
> > {
> >    hash = 0;
> >    for (char *p=key; *p; p++) {
> >       hash += (hash << 3) + (uint32_t)*p;
> >    }
> 
> This ist just shifting and multiplication with the bytes in key. I
> would expect this hash to have quite some unneccessary collisions
> when used on strings, though I'm by no means an expert in this area.

If I am not mistaken, this is a *very* common technique.  The most common 
algorithm shifts by 5 rather than 3, but if I remember right my tests showed 
that 3 was better -- at least with the filenames on my system.

> 
> >    /* Multiply by large prime number, take top bits, mask for remainder */
> >    index = ((hash * 1103515249) >> rshift) & mask;
> >    Dmsg2(100, "Leave hash_index hash=0x%x index=%d\n", hash, index);
> >    return;
> 
> The return statement is unneccesary.

Yes, I have removed it as it could be confusing.

> 
> > void htable::init(void *item, void *link, int tsize)
> > {
> >    int pwr;
> >    tsize >>= 2;
> >    for (pwr=0; tsize; pwr++) {
> >       tsize >>= 1;
> >    }
> >    loffset = (char *)link - (char *)item;
> 
> This is a constant, known at compile time, so make it a preprocessor
> define using offsetof.

Uh, what is "This is a constant"?   I don't see any constant. loffset depends 
on how one calls the routines.  Rather than using a C++ template, which 
duplicates the code for each different structure that contains a hash link, 
Bacula computes the offset of the link in the structure, and thus with the 
same source code can handle any number of different user structures that have 
hash links.  I have done the same for alist (allocated list) and dlist 
(doubly linked lists).  The downside of this technique is that the higher 
level code must cast because it passes void pointers to the structures.

It is a bit of a tricky point, so I'm not sure the above is clear.

> 
> >    mask = ~((~0)<<pwr);               /* 3 bits => table size = 8 */
> >    rshift = 30 - pwr;                 /* start using bits 28, 29, 30 */
> >    num_items = 0;                     /* number of entries in table */
> >    buckets = 1<<pwr;                  /* hash table size -- power of two 
*/
> >    max_items = buckets * 4;           /* allow average 4 entries per chain 
*/
> 
> As max_items is always four times the number of buckets, we could
> get rid of max_items. 

Yes, good. I agree.

> More importantly, I think it is not optimal 
> to store that many keys in the table because that means we'll have
> in average four hits per bucket. I'd rather choose max_items _smaller_
> than the number of buckets. For instance the git code which uses hash
> tables extensively, grows its table whenever it is filled 66%.

I'd like to see some tests on this.  I'm not convinced that the % full is a 
good criterion because you could then have one bucket that has 100 items in 
it which could be catastrophic.

Once I implement a bloom filter, only 15% (with the parameters that I have 
chosen) of the collisions will require a search of the buckets.

> 
> >    table = (hlink **)malloc(buckets * sizeof(hlink *));
> 
> unneccessary cast.

In C++ code as opposed to C the cast is *required* or you get a compiler 
warning (or perhaps an error -- I forget).

> 
> >    memset(table, 0, buckets * sizeof(hlink *));
> 
> how about calloc() instead of malloc + memset?

I'm not sure that smartall (our underlying malloc routine) implements 
calloc().  I'll look at it.

> 
> > /*
> >  * Take each hash link and walk down the chain of items
> >  *  that hash there counting them (i.e. the hits), 
> >  *  then report that number.
> 
> Open addressing might be better than chaining to deal with
> hash collisions. In particular if items are never (or rarely)
> removed from the table.

Open addressing has the big advantage that it keeps entries together which 
improves CPU speed since everything is in the cache, but unless I am missing 
something, open addressing is extremely expensive in terms of memory 
requirements since the table has to always be the maximum size.  This could 
be reasonable if we know how many items are going to be added, but if we do 
not, I'm not sure.

I would like to see some real performance studies before changing.

> 
> >  *  Obiously, the more hits in a chain, the more time
> >  *  it takes to reference them. Empty chains are not so
> >  *  hot either -- as it means unused or wasted space.
> >  */
> > #define MAX_COUNT 20
> 
> Unless I'm mistaken, the code will fail badly if there are more than
> MAX_COUNT keys that hash to the same value.

Quite possibly, but I think there are some mitigating factors.  1. If I 
remember right the max should be 4.  2. this code is really used for debug. 
3. if it fails the stats will be broken, but there will be no serious 
consequences.  For just obtaining stats, I didn't want to spend too much time 
worrying about covering all possibilities.

> 
> > void htable::stats()
> > {
> >    int hits[MAX_COUNT];
> >    int max = 0;
> >    int i, j;
> >    hlink *p;
> >    printf("\n\nNumItems=%d\nTotal buckets=%d\n", num_items, buckets);
> >    printf("Hits/bucket: buckets\n");
> 
> Dmesg?

Yes, that would clearly be better.

> 
> >    for (i=0; i < MAX_COUNT; i++) {
> >       hits[i] = 0;
> >    }
> 
> memset(hits, 0 MAXCOUNT * sizeof(int));
> 
> >    for (i=0; i<(int)buckets; i++) {
> >       p = table[i];
> >       j = 0;
> >       while (p) {
> >          p = (hlink *)(p->next);
> 
> cast would become unneccessary if next were struct hlink*.

Yup :-)

> 
> > void htable::grow_table()
> > {
> >    Dmsg1(100, "Grow called old size = %d\n", buckets);
> >    /* Setup a bigger table */
> >    htable *big = (htable *)malloc(sizeof(htable));
> 
> unnecessary cast

See above.  Actually, if I were doing this right, I would probably use one of 
those new fangled C++ casts, but I find them so ugly and the syntax so hard 
to remember that I avoid them.

> 
> >    big->loffset = loffset;
> >    big->mask = mask<<1 | 1;
> >    big->rshift = rshift - 1;
> >    big->num_items = 0;
> >    big->buckets = buckets * 2;
> >    big->max_items = big->buckets * 4;
> 
> DRY ;)
> 
> >    big->table = (hlink **)malloc(big->buckets * sizeof(hlink *));
> >    memset(big->table, 0, big->buckets * sizeof(hlink *));
> 
> unnecessary cast, calloc().
> 
> >    /*
> >     * We walk through the old smaller tree getting items,
> >     * but since we are overwriting the colision links, we must
> >     * explicitly save the item->next pointer and walk each
> >     * colision chain ourselves.  We do use next() for getting
> >     * to the next bucket.
> >     */
> >    for (void *item=first(); item; ) {
> >       void *ni = ((hlink *)((char *)item+loffset))->next;  /* save link 
overwritten by insert */
> 
> It might be worth to improve readability by defining
> 
> #define next_item(item) ((hlink *)((char *)((item)+(loffset))

Yes, good idea. I was planning to do that as I did in dlist.c.

> 
> as this calculation is needed several times.
> 
> >       Dmsg1(100, "Grow insert: %s\n", ((hlink *)((char 
*)item+loffset))->key);
> >       big->insert(((hlink *)((char *)item+loffset))->key, item);
> >       if (ni) {
> >          item = (void *)((char *)ni-loffset);
> >       } else {
> >          walkptr = NULL;
> >          item = next();
> >       }
> >    }
> 
> This rather ugly code would become unneccessary when using open
> addressing instead of chaining.

Quite possibly, but see my remarks about open addressing above.

> 
> > void *htable::next()
> > {
> >    Dmsg1(100, "Enter next: walkptr=0x%x\n", (unsigned)walkptr);
> 
> 
> >    if (walkptr) {
> >       walkptr = (hlink *)(walkptr->next);
> 
> unnecessary cast

C++  :-)

Thanks for taking the time to look at this -- you are the first person who has 
been interested in this kind of code. I'll go over each of your points in 
detail -- especially since I will be going through this code in the near 
future to bring it up to date (#defines to simplify the pointer offsets), ...

Don't hesitate to respond if some of my remarks above are incorrect.

What always amuses me when I look back at code like this is that I wrote it 4 
years ago to be able to implement the base file project (de-duplication) and 
the project to track files added/removed from the system, but I've been so 
busy with other priority projects that I have not had time to get back and 
actually use these routines.

The same happened for my red/black tree (src/lib/rblist.c/h), except in that 
case, I have now actually put the code to use in the restore code giving a 
513 times speed up in version 2.1.x compared to 2.0.x for 800,000 files.

Best regards,

Kern

> 
> 
> Regards
> Andre
> -- 
> The only person who always got his work done by Friday was Robinson Crusoe
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] [Bacula-devel] Project for strong incremental backup assurance

Reply via email to