Hi all, We have been using riak for a few months now (started using 14.0 and we have recently upgraded to 14.2). Development of our app has been going well and I am now integrating my code w/ a larger system. The testing of the overall read / write performance of our cluster seems good as well.
I am now starting to dive further into map reduce queries, and unlike the regular read / writes that seem to perform very fast, I am seeing map reduce performance is getting worse as our data set grows. The query I am using to test the map / reduce speed and get a key count is this: map = function (v) { return [1]; } reduce = Riak.reduceSum It takes 138 seconds using that query on a bucket w/ 50,000 keys. It takes around 20 seconds using that query on a bucket w/ 108 keys. Do these query times for map reduce seem appropriate? I'll try and give an overall picture of how we currently use riak and maybe someone can say if the performance of our map / reduce operations is on par or if there are things I could tweak to try and get the query times to come down a bit. The system we have sends data to riak at a fairly fast pace and we need to keep all incoming data for 30 minutes, so we can examine the data and retrieve any individual keys. After 30 minutes we can aggregate messages into groups to reduce the overall number of keys and data. We currently have an "incoming" bucket where keys are written at a rate of around 20 / second. An archiving thread checks every so often for keys that are older then 30 min. and if it finds any it removes them from the 'incoming' bucket and aggregates them into an 'archive' bucket for the given hour. As you can imagine this causes the bitcask files to fragment and grow fairly large, however it seems like the best way to maintain some granularity of the data, but not be forced to keep every single data point that flows into the system. It also gives us a predictable growth rate for the 'archive' bucket even if the 'incoming' data increases beyond the 20 / second. One thing I was wondering and planning on trying was to reduce the bitcask configuration merge threshold and trigger values to help keep the files a little smaller which might help the map reduce performance some? Or should that even matter since I'm only looking through keys and those should be in memory anyway? We currently have a 4 node cluster running ubuntu 11.04 x64. Each riak node has 8 GB of memory, and 'free m' on the nodes reports around 4000 used and 4000 free on average. Basho Bench using the riakc_pb.config with a get=1, update=2, and put=3 for 5 minutes seems to be good... Here is the graph: http://tinypic.com/r/30tm977/7 So does anyone have any similarly sized systems where they use map reduce? Or can anyone recommend performance tweaks I can make that would help accelerate these queries? Thank you, -ryan
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com