Re: Further diff optimization ideas

Johan Corveleyn Tue, 15 Feb 2011 04:05:49 -0800

2011/2/15 Branko Čibej <br...@e-reka.si>:
> On 15.02.2011 01:42, Johan Corveleyn wrote:
>> 2) Faster hash function for hashing lines. [ Quick win ]
>>
>> - Currently, we use adler32 (after the line has been normalized).
>>
>> - GNU diff uses some simple bit shifting scheme, which seems to be a
>> lot faster (maybe it has more hash-collisions, but that doesn't seem
>> to matter that much).
>
> Is it (x << 5) + x? If so, that's the (in)famous "times 33" hash
> function, also see the essay in apr_hashfunc_default in
> http://svn.apache.org/viewvc/apr/apr/trunk/tables/apr_hash.c?view=markup


No, it's actually the following (see
http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/io.c):

    /* Rotate an unsigned value to the left.  */
    #define ROL(v, n) ((v) << (n) | (v) >> (sizeof (v) * CHAR_BIT - (n)))

    /* Given a hash value and a new character, return a new hash value.  */
    #define HASH(h, c) ((c) + ROL (h, 7))

(which is then called for every character in a for-loop (while
"normalizing" the characters for ignore-xxx options)).

I think this code pre-dates the technique used in apr_hash.c (the
"essay" you refer to in apr_hashfunc_default mentions a 1992 article,
referencing a 1990 Usenet message; that code in GNU diff is, I think,
already there since 1986 or something). So it's possible that it's a
little bit less optimal than the one from apr_hash.c (don't know,
really).

> I prefer APR's implementation, because a reasonably good compiler these
> days /should/ be able to optimize (x * 33) to ((x << 5) + x), depending
> on the characteristics of the target architecture.

Yes, I think I'll give it a try for hashing the lines for diff. Good
idea (I don't know my way very well around the libraries yet, so
thanks a lot for the pointer).

> In other news, why do we need our own hash function anyway? Or our own
> table or tree or whatever? NIH syndrome?

No idea, really. I'm probably not the guy you're expecting an answer
from :-), but:
- apr_hashfunc_default could be very useful here, I think.
- Concerning the tree (see token.c#tree_insert_token), maybe the
reason is that some special things are done while inserting nodes into
the tree? (like first comparing with the hash, and if that matches,
compare the lines themselves; and letting all matching nodes point to
the same "token" (line), when they're being inserted).

Cheers,
-- 
Johan

Re: Further diff optimization ideas

Reply via email to