We just had an enlightening experience, that I'd like to share regarding 
"choosing keys" ...

In one of the systems we're currently prototyping, we're moving a searchable 
audit log to Riak; and we want to size it to handle a sustained throughput of 
500 writes/second (N=3), keeping the data around for 5 years (thats ~45 billion 
keys x N=3).  The problem is that with this rate, there is no chance that we 
can keep all the keys in memory, and so bitcask was not really an option.  
Well, yes, ... bitcask could handle it with by just adding extra machines, but 
it is also OK if GETs are relatively slow.

Enter innostore.  As opposed to bitcask, it's slower, but lets you define how 
much memory to allocate for the in-memory representation of key/values, and 
lets us put much more data onto our hardware.

We were using 160-bit random keys (easily safe enough to not run into the 
"birthday problem"), but throughput was not good enough ( fell dramatically to 
~200/sec on our 4-node macmini dev cluster, when we ran out of memory after 
only ~10M key/values ).

Innostore uses a B-tree, and we realized that it was really suffering from the 
random keys, because it then needs to do I/O on random nodes of the B-tree.

So we changed the keys to be    "<<timestamp>>:<<random-bits>>"   i.e., such 
that successive writes have keys that are lexicographically close.  The random 
bits are there to make the chance of conflict small enough.   

Using such keys cause the underlying B-tree to only writes to a few nodes at a 
time, and ideally innostore only needs to keep tree-nodes in memory 
corresponding to a path from the root of the tree to the node currently being 
added to.

Now our little play-cluster with 4 mac mini's can handle sustained 2000 
writes/sec, or 10x throughput of our first setup.  All ... just by choosing 
some better keys.

Cheers,

Kresten
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to