TLDR: hey, what about using extendible hashing for bitcask keydirs? Constant-time lookups with two disk seeks end-to-end, much larger keyspaces than currently supportable, but without the total rehashing cost. Also avoids the O(log N) insertion/search/deletion costs of b-trees.

At length:

I've been thinking a lot recently about how to do quick lookups of keys where the space is much larger than memory--say, a few billion 32-byte keys. Similarly, bitcask is going to need to store more keys than can fit in an in-memory hashtable at some point.

One possibility is constructing bytewise (or multi-byte-wise) tries from the keys. These have the advantage of being orderable (hmm, range queries? faster bucket listing?), reasonably short, and supporting log(n) operations. You could cache the initial levels of the trie in memory and drop to disk for the leaves. An adaptive caching algorithm could also be used to maintain frequently accessed leaf nodes in memory. (the FS cache may actually provide acceptable results as well). It also takes advantage of the relatively low entropy of most Riak keys, and similar keys could be fast to access if they reside in nearby pages.

The major disadvantage is that trees can involve a lot of O(log N) churn for insertions, which... theoretically... sucks on disk. Obviously there are ways to make it perform well because ReiserFS and most DB indexes make use of them, but... maybe there are alternatives.

Ideally we want constant time operations, but hash tables usually come with awkward rehashing periods or insane space requirements. O(N) rehashing can block other operations, which blows latency through the roof when disks are involved. Not a good property for a k/v store.

So I started doodling some hybrid tree-hash structures, browsing through NIST's datastructures list, and lo and behold, there is actually a structure which combines some of the advantages of tries but behaves well for disk media!

http://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdf

You store values on disk in buckets which are small multiples of the page size. Finding a value involves choosing the bucket, reading the bucket from disk, and a linear search for the value.

To choose the bucket you use a hash table which specifies the on-disk address of the right bucket for your value. Here's the catch: you keep the index table small. In fact, it's a bitwise trie (or, even faster, a flattened hashtable) of the least significant bits of the hash of the key. As buckets fill up, you split them in half and (possibly) increase the depth of the index. Hence growth/shrinking is incremental and only operates on one bucket at a time.

In the case of bitcask, where values can be variable length, it probably makes sense to store the file ID/offset in the bucket, and take the hit of a second seek to support faster/more predictable searching over each bucket.

The downside is that this is still an in-memory hash table and can only store an additional (values/bucket). Perhaps dropping the index to disk as well and taking advantage of the FS cache over it could work?

--Kyle Kingsbury

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to