General Hash Functions In Python
Hi all, I've ported various hash functions to python if anyone is interested: def RSHash(key): a= 378551 b= 63689 hash = 0 for i in range(len(key)): hash = hash * a + ord(key[i]) a = a * b return (hash & 0x7FFF) def JSHash(key): hash = 1315423911 for i in range(len(key)): hash ^= ((hash << 5) + ord(key[i]) + (hash >> 2)) return (hash & 0x7FFF) def PJWHash(key): BitsInUnsignedInt = 4 * 8 ThreeQuarters = long((BitsInUnsignedInt * 3) / 4) OneEighth = long(BitsInUnsignedInt / 8) HighBits = (0x) << (BitsInUnsignedInt - OneEighth) hash = 0 test = 0 for i in range(len(key)): hash = (hash << OneEighth) + ord(key[i]) test = hash & HighBits if test != 0: hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits)); return (hash & 0x7FFF) def ELFHash(key): hash = 0 x= 0 for i in range(len(key)): hash = (hash << 4) + ord(key[i]) x = hash & 0xF000 if x != 0: hash ^= (x >> 24) hash &= ~x return (hash & 0x7FFF) def BKDRHash(key): seed = 131 # 31 131 1313 13131 131313 etc.. hash = 0 for i in range(len(key)): hash = (hash * seed) + ord(key[i]) return (hash & 0x7FFF) def SDBMHash(key): hash = 0 for i in range(len(key)): hash = ord(key[i]) + (hash << 6) + (hash << 16) - hash; return (hash & 0x7FFF) def DJBHash(key): hash = 5381 for i in range(len(key)): hash = ((hash << 5) + hash) + ord(key[i]) return (hash & 0x7FFF) def DEKHash(key): hash = len(key); for i in range(len(key)): hash = ((hash << 5) ^ (hash >> 27)) ^ ord(key[i]) return (hash & 0x7FFF) def APHash(key): hash = 0 for i in range(len(key)): if ((i & 1) == 0): hash ^= ((hash << 7) ^ ord(key[i]) ^ (hash >> 3)) else: hash ^= (~((hash << 11) ^ ord(key[i]) ^ (hash >> 5))) return (hash & 0x7FFF) print RSHash ('abcdefghijklmnopqrstuvwxyz1234567890') print JSHash ('abcdefghijklmnopqrstuvwxyz1234567890') print PJWHash ('abcdefghijklmnopqrstuvwxyz1234567890') print ELFHash ('abcdefghijklmnopqrstuvwxyz1234567890') print BKDRHash('abcdefghijklmnopqrstuvwxyz1234567890') print SDBMHash('abcdefghijklmnopqrstuvwxyz1234567890') print DJBHash ('abcdefghijklmnopqrstuvwxyz1234567890') print DEKHash ('abcdefghijklmnopqrstuvwxyz1234567890') print APHash ('abcdefghijklmnopqrstuvwxyz1234567890') Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
Hi Paul, For different data types different hash functions work better/worse aka fewer or more collisions. I believe the more choice people have and also the more ways people see how a particular thing can be done, then the easier it will be for them to come up with their own specific efficient solutions. That said, I believe at least one (most likely more) of the hash functions in the group above will most always work better (ala less collisions) than the standard default hash available in the built-in dict for any random set of strings. Please feel free to prove me wrong :) Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Paul Rubin wrote: > "Arash Partow" <[EMAIL PROTECTED]> writes: > > I've ported various hash functions to python if anyone is interested: > > Are these useful for any particular interoperability purposes? > Otherwise I'd say just one such hash is plenty, or maybe don't even > bother, since Python's built-in dicts are sufficient for most > applications that call for hashing, and the cryptographic hashes are > available for when you really care about uniformity of the hash > output. > > > for i in range(len(key)): > > hash = hash * a + ord(key[i]) > > a = a * b > > As a stylistic issue, I'd write this: > > for c in key: > hash = hash * a + ord(c) > a *= b > > and similarly for the other similar loops. -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
Trial and error - how else? :) I believe finding a perfect hash function for every kind and combination of data is a very very time consuming operation. Also there is the common case where the data is online (ie: stateless) that said it doesn't mean you can't make assumptions about the kind of data. Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Paul Rubin wrote: > "Arash Partow" <[EMAIL PROTECTED]> writes: > > For different data types different hash functions work > > better/worse aka fewer or more collisions. > > But you give no indication of which of those hashes works best for > what kind of data. How is the user supposed to figure out which one > to choose? -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
That is true, but I'm not about to do something that might potentially prove my point wrong... :) Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Robert Kern wrote: > > You have it backwards. You are the one making the positive assertion. You get > to > prove yourself correct. > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless enigma > that is made terrible by our own mad attempt to interpret it as though it > had > an underlying truth." >-- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
John Machin wrote: > Who is likely to bother? In timbot we trust. Have you read the comments > at the start of Objects/dictobject.c? > No I haven't probably wont be anytime soon, as far as time, well people interested, as is how started my original port, would be more than willing to try/assess the routines for sets of strings that they wish to hash etc.. this site may help explain plus has some code snippets that may help you understand what I mean. http://www.partow.net/programming/hashfunctions/index.html > [Aside: it's not "in the built-in dict". Any type which wants its > instances to be hashable defines its own hash method, one that suits the > type.] > > This belief would be based on: > (a) actual testing by you > or (b) a refereed academic paper which did such tests on hash functions > (including the Python "standard default hash") > or (c) ...??? > Experience has shown me that the belief than one "default" way of hashing is generally not the optimal way for the overwhelming majority of data, but then again > What is the Python "standard default hash", anyway? It is written in C. > Wouldn't it have been a good idea to provide a Python function for it, > so that people who are going to "come up with their own specific > efficient solutions" had something to compare with? Or perhaps they are > intended to "port" your functions back into C? > The above link provides a very simple hash test framework, the "standard default hash" can be placed in there and tested amongst the other algorithms. > A few more questions: > > Why have you truncated the answers to 31 bits? > algorithms were adapted from unsigned int, i should have removed that final truncation in python. > Have you tested that these functions produce the same output (apart from > the 31-bit thing) as the originals that you worked from? The reason for > asking is that Python unlike C doesn't lose any bits in the << > operation; if this is followed by a >> you may be shifting some unwanted > 1-bits back in again. > Some of them do others don't (not really important unless you are trying to be compatible with other implementations which this is not), I am aware of python not truncating/wrapping of values under various operations, I believe its a nice little side effect from python which gives more bits to play with as long as you don't truncate them as I have. > Talking about not losing bits: For your 36-byte example input, the > SDBMHash (with its << 16) is up to about 566 bits before you truncate it > to 31. A little over the top, perhaps. Maybe not the fastest way of > doing it. > Possibly, do you have a better solution I'm very keen to learn... > What is the purpose of the calls to long() in PJWHash? > trying to cast to long, looking at it now its rather superfluous. > And the $64K question: What is the quintessential difference between > PJWHash and ELFHash? > Nothing, elf is essentially pjw, its just optimised for 32-bit systems in that the calculation for th's etc are static where has pjw required sizeof to calc the th's which i couldn't find a way of doing, so i fudged it in the hope that maybe sometime in the future a work around of sorts could be developed. Again this was just a simple posting, I hoped to maybe gets some comments and may present some ideas to people, I don't expect anyone to drop everything and start using these, just thought it might be something interesting for this NG. btw I'm not a very good python programmer ATM. Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
I would like to but I think my lack of language and time will be a barrier. Also I don't believe there is much material there to warrant a technical write up. Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Stefan Behnel wrote: > Arash Partow wrote: > > I've ported various hash functions to python if anyone is interested: > > [snip] > > Ok, so if you think they are useful, what about writing up an article for the > Python Cookbook that describes their usage and specific > advantages/disadvantages? > > http://aspn.activestate.com/ASPN/Python/Cookbook/ > > Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: General Hash Functions In Python
John Machin wrote: > Perhaps you should, if you profess an interest in hashed lookup -- it > gives some interesting commentary on the second aspect: collision > handling. What matters is the *total* time to return an answer. > Time or algorithm complexity is merely one aspect of a hash function design. there are many others. > > as far as time, well > > people interested, as is how started my original port, would be more > > than willing to try/assess the routines for sets of strings that they > > wish to hash etc. > > > this site may help explain plus has some code > > snippets that may help you understand what I mean. > > > > http://www.partow.net/programming/hashfunctions/index.html > > That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's > a Melbourne academic named Justin Zobel who has done something amazingly > similar: > > http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html > Same guy, he was a lecturer during my uni days. As far as his surname that is another issue altogether. > Searching for "Justin Sobel" did lead me to a Russian website which > apart from repeating your typo/reado/whatevero did propose (and test) > some more hash functions. > > http://vak.ru/doku.php/proj/hash/efficiency-en > > In particular look at the "rot13" function which was right up near the > front as far as number of collisions goes, and which would appear (my > guess based on reading the source) to be very fast (with the right > compiler (e.g. gcc 4)). > I've already spoken to the guy that did those measurements, there are some flaws in the way he represents data which could lead one to make inaccurate assumptions about the "quality" of the hash functions. One of the many issues that i took up with him is that he only used 1 set of data instead of having multiple sets and aggregating the results. Whether or not he decides to re-do is analysis is another issue altogether. TAOCP has a nice section on how potential analysis can be done. > I would have thought it important especially in the case of well-known > functions whose properties have been discussed in the literature that > you should not publish a version that gives a different answer, without > noting that fact prominently. > True, the c versions work fine, i guess the python versions require a bit more work. feel free to re-post modified versions :) > You can't avoid using Python longs if you want to simulate unsigned > 32-bit arithmetic. However judicious truncation can be used to stop the > longs becoming longer and slower. General rules: > 1. Avoid exceeding 32 bits where possible. E.g. instead of > hash <<= 8 > do > hash = (hash & 0xFF) << 8 > 2. Where unavoidable (e.g. hash *= constant), use hash &= 0x to > chop back to 32 bits, once per iteration. > > That is something I thought about, but I also considered, as I mentioned before, the extra bits. The more bits you have to avalanche with - (in a very general and oversimplified way) the better your hash function "can" be. > Thanks to Josh Bloch ([EMAIL PROTECTED]) who also informed me about > another fault that is found in Aho, Sethi and Ullman's book: The line > with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According > to a correspondence of Josh Bloch with Ravi Sethi this correction will > be made in the next edition of the book. Including these two changes > this hash function is now comparable to Vo's, Torek's and WAIS's hash > functions. > """ > > (1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to > your website? > Indeed! I had read about WAIS a long time ago, I'll be putting them up very soon, thanks for the input. > http://www.math.columbia.edu/~bayer/annote/root/root.html > """ > Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by > Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436. > > Hash unsigned X into H, using the temporary variable G. G and H are > unsigned variables; X may be an expression. G is nonzero e.g. 92% of > time, so a conditional expression would be slower. As noted by Josh > Bloch, 28 is the correct replacement for the frequent misprint 24. > > #define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf000, H ^= G >> > 28, H ^= G ) > """ I'm planning on adding various integer hash functions as well, just haven't had the time. the above seems like one so I'll give it a go. Arash Partow Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net -- http://mail.python.org/mailman/listinfo/python-list