Hi Dan, Thanks for the explanation. I haven't been looking into the MapReduce wiki page for a while so I missed that point about link walking. Also, thanks for the clarification that reduce phase is performed on the initiating node. Though I didn't use reduce functionality for now, it's an important fact to consider.
Thank you again. On Sun, Jun 27, 2010 at 8:28 PM, Dan Reverri <d...@basho.com> wrote: > Hi Dmitry, > Regarding map reduce query performance in a cluster, map phases are run in > parallel so adding more machines to the cluster means more map functions can > be run simultaneously. Reduce phases currently only run on the node that > initiated the query so additional machines will not affect the performance > of reduce phases. More information regarding map reduce is available on the > wiki: > https://wiki.basho.com/display/RIAK/MapReduce > This portion of the wiki discusses how Riak spreads map reduce queries: > https://wiki.basho.com/display/RIAK/MapReduce#MapReduce-HowRiakSpreadsProcessing > > Regarding links, they are a special type of map phase. A link phase collects > the bucket/key values defined in the "Link" header of an object and passes > them to the next phase. Link phases are accessible to the Erlang clients by > defining them in a map reduce query. For example, the following can be run > from "riak console": > %% get a local client > {ok,C} = riak:local_client(). > %% define a metadata dict with links > MD = dict:store(<<"Links">>, > [{{<<"bucket">>,<<"key2">>},<<"tag">>}],dict:new()). > RO = riak_object:new(<<"bucket">>,<<"key1">>,<<"value">>). > RO1 = riak_object:update_metadata(RO, MD). > %% save the object > C:put(RO1,1). > %% run a link phase in a map reduce query > C:mapred([{<<"bucket">>,<<"key1">>}],[{link, <<"bucket">>, '_', true}]). > The map reduce query returns: {ok,[[<<"bucket">>,<<"key2">>,<<"tag">>]]} > * Note, the above query used '_' for the tag portion of the query to > indicate that any tag is acceptable. > > Thank you, > Dan > Daniel Reverri > Developer Advocate > Basho Technologies, Inc. > d...@basho.com > > > On Sun, Jun 27, 2010 at 6:36 AM, Dmitry Demeshchuk <demeshc...@gmail.com> > wrote: >> >> On Fri, Jun 25, 2010 at 8:25 PM, Ryan Tilder <rtil...@basho.com> wrote: >> > Hi, Dmitry. There are some gaps in the information you included here >> > that >> > might help clarify what's going on so I'm going to just rattle off some >> > questions for clarification. >> > Is your test driver only making requests of a single EC2 instance? Or >> > are >> > you querying all 7 nodes directly in so sort of load distribution? If >> > you >> > aren't querying all 7 nodes directly, then you will likely see >> > performance >> > on par with a cluster with only a single "physical" node. >> >> I tried both ways: querying only one node and querying all the nodes. >> The results were approximately the same. But as far as I understand, >> for map-reduce queries it's an expected result, isn't it? >> >> > Are you certain that the 7 nodes are communicating with each other? The >> > output of the "riak-admin status" command should list the nodes in the >> > "ring_members" field. >> >> Yes, sure. >> >> > Are the "documents" a separate key with Riak's built in links to the >> > "entities" or are they keys with a data blob that refer to the >> > entities?[1] >> > If the latter, have you >> > read http://blog.basho.com/2010/02/24/link-walking-by-example/ ? >> >> To simplify, document data (I mean, value in Riak database) had >> structure like this: >> >> [{entities, [123, 456, 745, 2352, 235 | ...]}]. >> >> I actually used timestamps in microseconds for Ids but that doesn't >> really matter. >> And, regarding your last question, documents and entities were stored >> in different buckets. >> >> What about links? Should they give better speed in that case? Also, >> neither Erlang native API (I mean riak_client module) nor Erlang PBC >> seem to have link-walking functions like REST API. >> >> >> > It's also important for me to note that EC2 instances do not necessarily >> > have the same characteristics of actual physical hardware when it comes >> > to >> > preventing resource contention. Since EC2 instances are virtualized, >> > you >> > have no idea what other load the physical host of a given instance may >> > be >> > under. As a result it is possible to have a Riak instance running on >> > the >> > same hardware as another IO and CPU intensive instance without your >> > knowledge, impeding each other to a certain degree. We've had a number >> > of >> > users complain of performance problems with Riak clusters running on EC2 >> > at >> > various times. From my personal and anecdotal experience, EC2 seems to >> > be >> > pretty heavily oversubscribed much of the time which leads to >> > intermittent >> > performance issues for all kinds of applications. >> > All of that is just a long winded way of saying: don't expect shared >> > virtualized resources to provide the same performance as dedicated >> > physical >> > hardware. But you should still see at least somewhat better performance >> > that you're seeing now if your testing harness is testing properly. >> >> Sure, I understand that. But I expected at least a bit better performance. >> >> Anyway, the day before yesterday I ran some tests using basho_bench. >> These tests cheered me up a bit :) >> Here's the link to the results: >> >> http://demmonoid.livejournal.com/4098.html >> >> Please let me know if you want me to add or correct any links to your >> resources or add any more information about the tests. >> >> > --Ryan >> > 1. I'm not certain if you're saying that the documents are stored in a >> > separate bucket from the entities in the same Riak cluster or a separate >> > Riak cluster entirely. >> > On Fri, Jun 25, 2010 at 12:02 AM, Dmitry Demeshchuk >> > <demeshc...@gmail.com> >> > wrote: >> >> >> >> Greetings. >> >> >> >> I tried running Riak with bitcask backend on 7 Amazon EC2 standard >> >> large instances (7.5 GB RAM, 4 EC2 CPU units) and performed some >> >> tests. >> >> For comparison, I built up the following Riak clusters: >> >> >> >> 7 physical nodes ring >> >> 1 physical node ring (on one of the 7 instances, but I ran the tests >> >> separately so the rings won't mess with each other) >> >> 1 physical node ring on an extra large instance (15 GB RAM, 8EC2 CPU >> >> units) >> >> >> >> and ran a couple of tests with putting and getting data using Riak >> >> native Erlang API (not PBC). >> >> >> >> I had 2 buckets, the first one having small (averagely about 1KB) >> >> values, but a lot of them (about several millions) called "entities", >> >> and the second one having lists of keys from the first database, >> >> called "documents". So, every document consists of a lot of entities >> >> (I used 100 and 1000 for my tests). So, the approximate size of every >> >> document was either 100KB or 1MB. >> >> >> >> So, I performed tests of putting documents and entities to database >> >> and then obtaining them. I tried to perform reads and writes using 10 >> >> and 100 concurrent Erlang processes (well, 100 was generally too much >> >> as I ran out of CPU), first from only one machine and then from 2 and >> >> 3 machines at the same time (for the 7-nodes ring). Of course, the >> >> entities were obtained using map-reduce. >> >> >> >> The first weird thing was that even with 10 concurrent reads and >> >> writes the performance didn't differ for all three clusters. Okay, 1 >> >> large and 1 extra large nodes don't differ so much but the 7 nodes >> >> should have given me some performance, shouldn't they? >> >> >> >> The second thing was that the average read time for one document with >> >> 1000 entities was about 5 seconds, and again, the number of machines >> >> in the cluster didn't affect the result. I guess I just stumbled upon >> >> the performance of the instance that sent all the map-reduce requests >> >> and then collected the replies because when I ran tests on the other 2 >> >> instances, all three had the same performance. >> >> >> >> The other strange thing was that during data writes most of the time >> >> nodes were not io-loaded. If it was a one-stream write, it would be >> >> obvious. But it were 10 and then 20 and 30 simultaneous writing >> >> processes! >> >> >> >> >> >> Unfortunately I cannot provide the detailed results now, they are >> >> pretty messed up. I'm going to use basho_bench to make good graphs and >> >> tables of these tests. >> >> >> >> Any advises for the future tests or any explanations for such strange >> >> performance? >> >> >> >> Thank you in advance and sorry for a little messed up e-mail. >> >> >> >> -- >> >> Best regards, >> >> Dmitry Demeshchuk >> >> >> >> _______________________________________________ >> >> riak-users mailing list >> >> riak-users@lists.basho.com >> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > >> > >> >> >> >> -- >> Best regards, >> Dmitry Demeshchuk >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > -- Best regards, Dmitry Demeshchuk _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com