On 17/07/2006 10:13 PM, Arash Partow wrote: > John Machin wrote: >> Who is likely to bother? In timbot we trust. Have you read the comments >> at the start of Objects/dictobject.c? >> > No I haven't probably wont be anytime soon,
Perhaps you should, if you profess an interest in hashed lookup -- it gives some interesting commentary on the second aspect: collision handling. What matters is the *total* time to return an answer. > as far as time, well > people interested, as is how started my original port, would be more > than willing to try/assess the routines for sets of strings that they > wish to hash etc. > this site may help explain plus has some code > snippets that may help you understand what I mean. > > http://www.partow.net/programming/hashfunctions/index.html That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's a Melbourne academic named Justin Zobel who has done something amazingly similar: http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html Searching for "Justin Sobel" did lead me to a Russian website which apart from repeating your typo/reado/whatevero did propose (and test) some more hash functions. http://vak.ru/doku.php/proj/hash/efficiency-en In particular look at the "rot13" function which was right up near the front as far as number of collisions goes, and which would appear (my guess based on reading the source) to be very fast (with the right compiler (e.g. gcc 4)). > > >> A few more questions: >> >> Have you tested that these functions produce the same output (apart from >> the 31-bit thing) as the originals that you worked from? The reason for >> asking is that Python unlike C doesn't lose any bits in the << >> operation; if this is followed by a >> you may be shifting some unwanted >> 1-bits back in again. >> > > Some of them do others don't (not really important unless you are > trying to be compatible with other implementations which this is > not), I would have thought it important especially in the case of well-known functions whose properties have been discussed in the literature that you should not publish a version that gives a different answer, without noting that fact prominently. > I am aware of python not truncating/wrapping of values under > various operations, I believe its a nice little side effect from > python which gives more bits to play with as long as you don't > truncate them as I have. > > >> Talking about not losing bits: For your 36-byte example input, the >> SDBMHash (with its << 16) is up to about 566 bits before you truncate it >> to 31. A little over the top, perhaps. Maybe not the fastest way of >> doing it. >> > Possibly, do you have a better solution I'm very keen to learn... You can't avoid using Python longs if you want to simulate unsigned 32-bit arithmetic. However judicious truncation can be used to stop the longs becoming longer and slower. General rules: 1. Avoid exceeding 32 bits where possible. E.g. instead of hash <<= 8 do hash = (hash & 0xFFFFFF) << 8 2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to chop back to 32 bits, once per iteration. > > >> What is the purpose of the calls to long() in PJWHash? >> > trying to cast to long, looking at it now its rather superfluous. > > >> And the $64K question: What is the quintessential difference between >> PJWHash and ELFHash? >> > Nothing, elf is essentially pjw, its just optimised for 32-bit systems > in that the calculation for th's etc are static where has pjw > required sizeof to calc the th's You've found a C compiler where sizeof(unsigned int) is not static i.e. calculated by the compiler at compile time??? > which i couldn't find a way of doing, > so i fudged it in the hope that maybe sometime in the future a work > around of sorts could be developed. Google is a wonderful thing: http://users.physik.tu-muenchen.de/gammel/matpack/html/LibDoc/Strings/strings.html """ Thanks to Josh Bloch ([EMAIL PROTECTED]) who also informed me about another fault that is found in Aho, Sethi and Ullman's book: The line with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According to a correspondence of Josh Bloch with Ravi Sethi this correction will be made in the next edition of the book. Including these two changes this hash function is now comparable to Vo's, Torek's and WAIS's hash functions. """ (1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to your website? http://www.math.columbia.edu/~bayer/annote/root/root.html """ Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436. Hash unsigned X into H, using the temporary variable G. G and H are unsigned variables; X may be an expression. G is nonzero e.g. 92% of time, so a conditional expression would be slower. As noted by Josh Bloch, 28 is the correct replacement for the frequent misprint 24. #define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >> 28, H ^= G ) """ Cheers, John -- http://mail.python.org/mailman/listinfo/python-list