> 
> Unfortunately, even if additional nodes yield linear performance
> gains, the m/r overhead seems very large -- if I'm getting 1.5 seconds
> to process 1,000 items on one node, it seems apparent that I should
> get roughtly 1.5 seconds to process 3,000 items on 3 nodes, which
> still is awfully slow.
> 
> Do you know how Riak compares to HBase, MongoDB or Cassandra for large
> dataset processing and analysis with m/r, when talking hundreds of
> millions, or even billions of keys? It would seem that key traversal
> performance would preventing Riak from competing in that space. Maybe
> you could do something with Riak Search, but I'm not sure if it would
> comparable.

To be fair, you can't do a microbenchmark and then try to extrapolate it to 
large datasets; things change at scale. Also, key-listing has been a known 
limitation of Riak for a long time, and one we have been quite vocal about. 
There have been improvements recently, but it's still an O(N) computation where 
N is the total number of keys stored in the cluster. Therefore, it's important 
to structure your data such that you limit the use of key lists. Compare 
performance after you have done that, and run your benchmark on something other 
than a single node (4 or more in a cluster is best), with a dataset that 
approximates the target size.

Sean Cribbs <s...@basho.com>
Developer Advocate
Basho Technologies, Inc.
http://basho.com/


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to