Performance issues with small dataset

Alexander Staubo Wed, 12 Jan 2011 17:34:21 -0800

I'm experimenting with a test dataset to gauge whether Riak is
suitable for a particular app. My real dataset has millions of
records, but I'm testing with just a thousand items, and
unfortunately, I am getting horrible performance -- so horrible it
can't possibly be right. What am I doing wrong?


My environment:

* Riak 0.14 with default config
* Sean Cribb's Ruby client
* MacOS X Snow Leopard
* Ruby 1.9.2
* Erlang R14B01 from MacPorts

I am testing with a single node on my MacBook, which should be enough
for just a thousand key/value-pairs. These tests are run on an
initially empty database, from a single Ruby app. Each test has been
run at least 10 times consecutively to eliminate outliers and ensure
optimal cache fill.

Here are some numbers:

* 9.6 seconds to store 1,000 items. They are loaded from a text file
as JSON data. Parsing/processing overhead is about 0.8 s, the rest is
Riak. In JSON format, the items total 570 KB. The resultant Bitcask
data directory is 3.9 MB.
* 0.3 seconds to list all keys in the bucket [1].
* 1.8 seconds to list all keys and then fetch each object [2].
* 1.5 seconds to run a very simple map/reduce query [3].

Here's something else that is weird. I repeated the steps above on a
new, empty bucket, again using just 1,000 items, but after loading 1.5
million items into a separate, empty bucket. The numbers now are very
odd:

* 4.5 seconds to list all keys.
* 6.5 seconds to list + fetch.
* 5.1 seconds to run map/reduce query.

Why are operations on the small bucket suddenly worse in the presence
of a separate, large bucket? Surely the key spaces are completely
separate? Even listing keys or querying on an *empty* bucket is taking
several seconds in this scenario.

So are these timings appropriate for such a tiny dataset, and if not,
what could I be doing wrong? I'm new to Riak and I'm not sure if the
map/reduce-query is optimally expressed, so maybe that could be fixed.
Even so, storage and key-querying performance seems off by perhaps an
order of magnitude.

I have confirmed the performance issue on an Amazon EC2 instance
running Ubuntu Maverick, where performance was in fact considerably
worse.

[1] Just looping over bucket.keys.

[2] Basically: bucket.keys { |keys| keys.each { |key| bucket.get(key) } }

[3] Here's the query code. Each stored item is a JSON hash from which
a key ("path") is mapped, then reduced to aggregate the counts of each
path.

      mr = Riak::MapReduce.new(client)
      mr.add("test")
      mr.map <<-end, :keep => false
        function(v) {
          var paths = [];
          var entry = Riak.mapValuesJson(v)[0];
          var out = {};
          out[entry.path] = 1;
          paths.push(out);
          return paths;
        }
      end
      mr.reduce <<-end.strip, :keep => true
        function(values) {
          var result = {};
          for (var i = 0; i < values.length; i++) {
            var table = values[i];
            for (var k in table) {
              var count = table[k];
              if (result[k]) {
                result[k] += count;
              } else {
                result[k] = count;
              }
            }
          }
          return [result];
        }
      end
      results = mr.run

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Performance issues with small dataset

Reply via email to