I'm experimenting with a test dataset to gauge whether Riak is suitable for a particular app. My real dataset has millions of records, but I'm testing with just a thousand items, and unfortunately, I am getting horrible performance -- so horrible it can't possibly be right. What am I doing wrong?
My environment: * Riak 0.14 with default config * Sean Cribb's Ruby client * MacOS X Snow Leopard * Ruby 1.9.2 * Erlang R14B01 from MacPorts I am testing with a single node on my MacBook, which should be enough for just a thousand key/value-pairs. These tests are run on an initially empty database, from a single Ruby app. Each test has been run at least 10 times consecutively to eliminate outliers and ensure optimal cache fill. Here are some numbers: * 9.6 seconds to store 1,000 items. They are loaded from a text file as JSON data. Parsing/processing overhead is about 0.8 s, the rest is Riak. In JSON format, the items total 570 KB. The resultant Bitcask data directory is 3.9 MB. * 0.3 seconds to list all keys in the bucket [1]. * 1.8 seconds to list all keys and then fetch each object [2]. * 1.5 seconds to run a very simple map/reduce query [3]. Here's something else that is weird. I repeated the steps above on a new, empty bucket, again using just 1,000 items, but after loading 1.5 million items into a separate, empty bucket. The numbers now are very odd: * 4.5 seconds to list all keys. * 6.5 seconds to list + fetch. * 5.1 seconds to run map/reduce query. Why are operations on the small bucket suddenly worse in the presence of a separate, large bucket? Surely the key spaces are completely separate? Even listing keys or querying on an *empty* bucket is taking several seconds in this scenario. So are these timings appropriate for such a tiny dataset, and if not, what could I be doing wrong? I'm new to Riak and I'm not sure if the map/reduce-query is optimally expressed, so maybe that could be fixed. Even so, storage and key-querying performance seems off by perhaps an order of magnitude. I have confirmed the performance issue on an Amazon EC2 instance running Ubuntu Maverick, where performance was in fact considerably worse. [1] Just looping over bucket.keys. [2] Basically: bucket.keys { |keys| keys.each { |key| bucket.get(key) } } [3] Here's the query code. Each stored item is a JSON hash from which a key ("path") is mapped, then reduced to aggregate the counts of each path. mr = Riak::MapReduce.new(client) mr.add("test") mr.map <<-end, :keep => false function(v) { var paths = []; var entry = Riak.mapValuesJson(v)[0]; var out = {}; out[entry.path] = 1; paths.push(out); return paths; } end mr.reduce <<-end.strip, :keep => true function(values) { var result = {}; for (var i = 0; i < values.length; i++) { var table = values[i]; for (var k in table) { var count = table[k]; if (result[k]) { result[k] += count; } else { result[k] = count; } } } return [result]; } end results = mr.run _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com