Re: Performance issues with small dataset

Alexander Sicular Wed, 12 Jan 2011 18:08:30 -0800

Item number one: if you are using stock riak then you are also usingthe stock nval (number of replicas) of 3. This means that your 1000 k/v write is actually 3000 items written to disk.

Next, riak traverses the entire key space when doing an m/r over abucket. You either must explicitly provide bucket/key pairs to an m/ror explore the new key filtering provided in the 0.14 release.Listkeys is a very costly operation and only increases in cost as yournumber of keys grow.

There are more caveats but I'll end with three. For any criticallyperformant system you must use the protocol buffers interface and youmust juggle connections. Additionally, anonymous JavaScript functionshave a penalty associated. Lastly you should also upgrade fromJavaScript m/r functions to erlang. There is performance impedancewhen pushing json from the native erlang interface into the JavaScriptvm.

Riak has many benefits but bleeding single node performance is not oneof them. Predictable, scaleable units of performance per nodethroughout a cluster is.


Best,
Alexander

@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Jan 12, 2011, at 20:33, Alexander Staubo <li...@purefiction.net>wrote:

I'm experimenting with a test dataset to gauge whether Riak is
suitable for a particular app. My real dataset has millions of
records, but I'm testing with just a thousand items, and
unfortunately, I am getting horrible performance -- so horrible it
can't possibly be right. What am I doing wrong?

My environment:

* Riak 0.14 with default config
* Sean Cribb's Ruby client
* MacOS X Snow Leopard
* Ruby 1.9.2
* Erlang R14B01 from MacPorts

I am testing with a single node on my MacBook, which should be enough
for just a thousand key/value-pairs. These tests are run on an
initially empty database, from a single Ruby app. Each test has been
run at least 10 times consecutively to eliminate outliers and ensure
optimal cache fill.

Here are some numbers:

* 9.6 seconds to store 1,000 items. They are loaded from a text file
as JSON data. Parsing/processing overhead is about 0.8 s, the rest is
Riak. In JSON format, the items total 570 KB. The resultant Bitcask
data directory is 3.9 MB.
* 0.3 seconds to list all keys in the bucket [1].
* 1.8 seconds to list all keys and then fetch each object [2].
* 1.5 seconds to run a very simple map/reduce query [3].

Here's something else that is weird. I repeated the steps above on a
new, empty bucket, again using just 1,000 items, but after loading 1.5
million items into a separate, empty bucket. The numbers now are very
odd:

* 4.5 seconds to list all keys.
* 6.5 seconds to list + fetch.
* 5.1 seconds to run map/reduce query.

Why are operations on the small bucket suddenly worse in the presence
of a separate, large bucket? Surely the key spaces are completely
separate? Even listing keys or querying on an *empty* bucket is taking
several seconds in this scenario.

So are these timings appropriate for such a tiny dataset, and if not,
what could I be doing wrong? I'm new to Riak and I'm not sure if the
map/reduce-query is optimally expressed, so maybe that could be fixed.
Even so, storage and key-querying performance seems off by perhaps an
order of magnitude.

I have confirmed the performance issue on an Amazon EC2 instance
running Ubuntu Maverick, where performance was in fact considerably
worse.

[1] Just looping over bucket.keys.

[2] Basically: bucket.keys { |keys| keys.each { |key| bucket.get(key) } }


[3] Here's the query code. Each stored item is a JSON hash from which
a key ("path") is mapped, then reduced to aggregate the counts of each
path.

     mr = Riak::MapReduce.new(client)
     mr.add("test")
     mr.map <<-end, :keep => false
       function(v) {
         var paths = [];
         var entry = Riak.mapValuesJson(v)[0];
         var out = {};
         out[entry.path] = 1;
         paths.push(out);
         return paths;
       }
     end
     mr.reduce <<-end.strip, :keep => true
       function(values) {
         var result = {};
         for (var i = 0; i < values.length; i++) {
           var table = values[i];
           for (var k in table) {
             var count = table[k];
             if (result[k]) {
               result[k] += count;
             } else {
               result[k] = count;
             }
           }
         }
         return [result];
       }
     end
     results = mr.run

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Performance issues with small dataset

Reply via email to