Bitcask - large keydirs

Aphyr Thu, 10 Mar 2011 15:36:45 -0800

TLDR: hey, what about using extendible hashing for bitcask keydirs?Constant-time lookups with two disk seeks end-to-end, much largerkeyspaces than currently supportable, but without the total rehashingcost. Also avoids the O(log N) insertion/search/deletion costs of b-trees.


At length:

I've been thinking a lot recently about how to do quick lookups of keyswhere the space is much larger than memory--say, a few billion 32-bytekeys. Similarly, bitcask is going to need to store more keys than canfit in an in-memory hashtable at some point.

One possibility is constructing bytewise (or multi-byte-wise) tries fromthe keys. These have the advantage of being orderable (hmm, rangequeries? faster bucket listing?), reasonably short, and supportinglog(n) operations. You could cache the initial levels of the trie inmemory and drop to disk for the leaves. An adaptive caching algorithmcould also be used to maintain frequently accessed leaf nodes in memory.(the FS cache may actually provide acceptable results as well). It alsotakes advantage of the relatively low entropy of most Riak keys, andsimilar keys could be fast to access if they reside in nearby pages.

The major disadvantage is that trees can involve a lot of O(log N) churnfor insertions, which... theoretically... sucks on disk. Obviously thereare ways to make it perform well because ReiserFS and most DB indexesmake use of them, but... maybe there are alternatives.

Ideally we want constant time operations, but hash tables usually comewith awkward rehashing periods or insane space requirements. O(N)rehashing can block other operations, which blows latency through theroof when disks are involved. Not a good property for a k/v store.

So I started doodling some hybrid tree-hash structures, browsing throughNIST's datastructures list, and lo and behold, there is actually astructure which combines some of the advantages of tries but behaveswell for disk media!


http://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdf

You store values on disk in buckets which are small multiples of thepage size. Finding a value involves choosing the bucket, reading thebucket from disk, and a linear search for the value.

To choose the bucket you use a hash table which specifies the on-diskaddress of the right bucket for your value. Here's the catch: you keepthe index table small. In fact, it's a bitwise trie (or, even faster, aflattened hashtable) of the least significant bits of the hash of thekey. As buckets fill up, you split them in half and (possibly) increasethe depth of the index. Hence growth/shrinking is incremental and onlyoperates on one bucket at a time.

In the case of bitcask, where values can be variable length, it probablymakes sense to store the file ID/offset in the bucket, and take the hitof a second seek to support faster/more predictable searching over eachbucket.

The downside is that this is still an in-memory hash table and can onlystore an additional (values/bucket). Perhaps dropping the index to diskas well and taking advantage of the FS cache over it could work?


--Kyle Kingsbury

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Bitcask - large keydirs

Reply via email to