2011/2/15 Branko Čibej <br...@e-reka.si>: > On 15.02.2011 01:42, Johan Corveleyn wrote: >> 2) Faster hash function for hashing lines. [ Quick win ] >> >> - Currently, we use adler32 (after the line has been normalized). >> >> - GNU diff uses some simple bit shifting scheme, which seems to be a >> lot faster (maybe it has more hash-collisions, but that doesn't seem >> to matter that much). > > Is it (x << 5) + x? If so, that's the (in)famous "times 33" hash > function, also see the essay in apr_hashfunc_default in > http://svn.apache.org/viewvc/apr/apr/trunk/tables/apr_hash.c?view=markup
No, it's actually the following (see http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/io.c): /* Rotate an unsigned value to the left. */ #define ROL(v, n) ((v) << (n) | (v) >> (sizeof (v) * CHAR_BIT - (n))) /* Given a hash value and a new character, return a new hash value. */ #define HASH(h, c) ((c) + ROL (h, 7)) (which is then called for every character in a for-loop (while "normalizing" the characters for ignore-xxx options)). I think this code pre-dates the technique used in apr_hash.c (the "essay" you refer to in apr_hashfunc_default mentions a 1992 article, referencing a 1990 Usenet message; that code in GNU diff is, I think, already there since 1986 or something). So it's possible that it's a little bit less optimal than the one from apr_hash.c (don't know, really). > I prefer APR's implementation, because a reasonably good compiler these > days /should/ be able to optimize (x * 33) to ((x << 5) + x), depending > on the characteristics of the target architecture. Yes, I think I'll give it a try for hashing the lines for diff. Good idea (I don't know my way very well around the libraries yet, so thanks a lot for the pointer). > In other news, why do we need our own hash function anyway? Or our own > table or tree or whatever? NIH syndrome? No idea, really. I'm probably not the guy you're expecting an answer from :-), but: - apr_hashfunc_default could be very useful here, I think. - Concerning the tree (see token.c#tree_insert_token), maybe the reason is that some special things are done while inserting nodes into the tree? (like first comparing with the hash, and if that matches, compare the lines themselves; and letting all matching nodes point to the same "token" (line), when they're being inserted). Cheers, -- Johan