On Sun, Apr 8, 2012 at 1:55 PM, Stefan Plantikow <stefan.planti...@googlemail.com> wrote: > > Hi, > > > Am Sonntag, 8. April 2012 um 22:32 schrieb Brian Anderson: > >> > Hashing algorithms (hopscotch, bloom filters) could greatly benefit from >> > having access to the llvm bit manipulation intrinsics (ctpop, ctlz, cttz, >> > bswap). I think the general plan was to access these using some form of >> > inline llvm asm. However in the absence of that I wonder wether we should >> > just have support for those directly in core or std for all the integer >> > types (quite some languages do that). >> >> >> We used to have an 'llvm' ABI for native mods that was intended to give >> access to the llvm intrinsics. We could add that back, or just add them >> as rust intrinsics ('rust-intrinsic' ABI) as needed. >> > > That would be nice. For hash table algorithms, ctpop/ctlz/cttz will be really > useful (should also speedup the sudoku benchmark :). And swap is quite > helpful for dealing with utf8 byte order swapping. How would one call these > via the rust intrinsics? I am not deep enough into the llvm-rust-bits. > > >> >> > Feedback/Suggestions? >> > > Is going for hopscotch a good idea? Ah well, will try in any case ;)
One of the most surprisingly awesome hash table algorithms (to me at least) is open addressing based on "robin hood hashing". It's been around for ages, but almost nobody knows about it. It's *such* a simple (i.e. fast!) tweak to the regular algorithm, and it makes all the difference. I implemented the basic ops in C++ and it was something like two orders of magnitude faster than the built in unordered_set in Visual Studio for insertions and one order of magnitude for lookups. Basically, you do the "normal" open addressing where if there's a collision you just linearly walk until you find an empty slot. However, there's a twist. For each filled element you check its "displacement" (i.e. distance from the hash bucket it "wants" to be in to its current location). If the displacement for the value you're trying to place is larger than the displacement of the value already stored at the location then you swap place and keep going with the other value. This simple change causes variance of displacements to drop way down and means you can get 90+% occupancy with just a few probes per query. (Turns out this makes deletions and queries simpler too, because the latter exits based on this displacement invariant (instead of exiting based on finding an empty slot), which means the former can be done by just marking the slot as empty.) You need some way to efficiently check the "displacement" of a stored element. Maybe you store a small number (3 bits is enough) that contains the actual displacement, but a simpler strategy is to just cache the hash value with the element. You usually do this anyway because calling the hash function is sometimes expensive so you want a cheap "early out" when checking equality (especially for things like strings). Anyway, I like this quite a bit better than hopscotch hashing because it's just *so* damn simple. With my suggestion above you get one DWORD of overhead per element (for the cached hash value). No pointers. One cache miss per lookup, typically, and all the code is very simple logic and runs really fast. -- Sebastian Sylvan _______________________________________________ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev