Re: Riak and Amazon EC2
On Fri, Jun 25, 2010 at 8:25 PM, Ryan Tilder wrote: > Hi, Dmitry. There are some gaps in the information you included here that > might help clarify what's going on so I'm going to just rattle off some > questions for clarification. > Is your test driver only making requests of a single EC2 instance? Or are > you querying all 7 nodes directly in so sort of load distribution? If you > aren't querying all 7 nodes directly, then you will likely see performance > on par with a cluster with only a single "physical" node. I tried both ways: querying only one node and querying all the nodes. The results were approximately the same. But as far as I understand, for map-reduce queries it's an expected result, isn't it? > Are you certain that the 7 nodes are communicating with each other? The > output of the "riak-admin status" command should list the nodes in the > "ring_members" field. Yes, sure. > Are the "documents" a separate key with Riak's built in links to the > "entities" or are they keys with a data blob that refer to the entities?[1] > If the latter, have you > read http://blog.basho.com/2010/02/24/link-walking-by-example/ ? To simplify, document data (I mean, value in Riak database) had structure like this: [{entities, [123, 456, 745, 2352, 235 | ...]}]. I actually used timestamps in microseconds for Ids but that doesn't really matter. And, regarding your last question, documents and entities were stored in different buckets. What about links? Should they give better speed in that case? Also, neither Erlang native API (I mean riak_client module) nor Erlang PBC seem to have link-walking functions like REST API. > It's also important for me to note that EC2 instances do not necessarily > have the same characteristics of actual physical hardware when it comes to > preventing resource contention. Since EC2 instances are virtualized, you > have no idea what other load the physical host of a given instance may be > under. As a result it is possible to have a Riak instance running on the > same hardware as another IO and CPU intensive instance without your > knowledge, impeding each other to a certain degree. We've had a number of > users complain of performance problems with Riak clusters running on EC2 at > various times. From my personal and anecdotal experience, EC2 seems to be > pretty heavily oversubscribed much of the time which leads to intermittent > performance issues for all kinds of applications. > All of that is just a long winded way of saying: don't expect shared > virtualized resources to provide the same performance as dedicated physical > hardware. But you should still see at least somewhat better performance > that you're seeing now if your testing harness is testing properly. Sure, I understand that. But I expected at least a bit better performance. Anyway, the day before yesterday I ran some tests using basho_bench. These tests cheered me up a bit :) Here's the link to the results: http://demmonoid.livejournal.com/4098.html Please let me know if you want me to add or correct any links to your resources or add any more information about the tests. > --Ryan > 1. I'm not certain if you're saying that the documents are stored in a > separate bucket from the entities in the same Riak cluster or a separate > Riak cluster entirely. > On Fri, Jun 25, 2010 at 12:02 AM, Dmitry Demeshchuk > wrote: >> >> Greetings. >> >> I tried running Riak with bitcask backend on 7 Amazon EC2 standard >> large instances (7.5 GB RAM, 4 EC2 CPU units) and performed some >> tests. >> For comparison, I built up the following Riak clusters: >> >> 7 physical nodes ring >> 1 physical node ring (on one of the 7 instances, but I ran the tests >> separately so the rings won't mess with each other) >> 1 physical node ring on an extra large instance (15 GB RAM, 8EC2 CPU >> units) >> >> and ran a couple of tests with putting and getting data using Riak >> native Erlang API (not PBC). >> >> I had 2 buckets, the first one having small (averagely about 1KB) >> values, but a lot of them (about several millions) called "entities", >> and the second one having lists of keys from the first database, >> called "documents". So, every document consists of a lot of entities >> (I used 100 and 1000 for my tests). So, the approximate size of every >> document was either 100KB or 1MB. >> >> So, I performed tests of putting documents and entities to database >> and then obtaining them. I tried to perform reads and writes using 10 >> and 100 concurrent Erlang processes (well, 100 was generally too much >> as I ran out of CPU), first from only one machine and then from 2 and >> 3 machines at the same time (for the 7-nodes ring). Of course, the >> entities were obtained using map-reduce. >> >> The first weird thing was that even with 10 concurrent reads and >> writes the performance didn't differ for all three clusters. Okay, 1 >> large and 1 extra large nodes don't differ so much but the 7 nodes >> should have given me
A small note about basho_bench
Greetings. I noticed that the pictures that R generates from basho_bench results have one inconvenient point. All the graphs (at least, all I generated myself or saw in the Internet) with separated requests latencies (i.e. second and third lines) always have much free space at the top of the curves. At the same time, small values like median and mean are often at the very bottom of the graph so it's sometimes hard to say the approximate value of the curve. I don't know Rscript and ggplot at all so I'm not sure if zooming extends is easy in that case. But anyway, this thing would be great. Thank you. -- Best regards, Dmitry Demeshchuk ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Riak and Amazon EC2
Hi Dmitry, Regarding map reduce query performance in a cluster, map phases are run in parallel so adding more machines to the cluster means more map functions can be run simultaneously. Reduce phases currently only run on the node that initiated the query so additional machines will not affect the performance of reduce phases. More information regarding map reduce is available on the wiki: https://wiki.basho.com/display/RIAK/MapReduce This portion of the wiki discusses how Riak spreads map reduce queries: https://wiki.basho.com/display/RIAK/MapReduce#MapReduce-HowRiakSpreadsProcessing Regarding links, they are a special type of map phase. A link phase collects the bucket/key values defined in the "Link" header of an object and passes them to the next phase. Link phases are accessible to the Erlang clients by defining them in a map reduce query. For example, the following can be run from "riak console": %% get a local client {ok,C} = riak:local_client(). %% define a metadata dict with links MD = dict:store(<<"Links">>, [{{<<"bucket">>,<<"key2">>},<<"tag">>}],dict:new()). RO = riak_object:new(<<"bucket">>,<<"key1">>,<<"value">>). RO1 = riak_object:update_metadata(RO, MD). %% save the object C:put(RO1,1). %% run a link phase in a map reduce query C:mapred([{<<"bucket">>,<<"key1">>}],[{link, <<"bucket">>, '_', true}]). The map reduce query returns: {ok,[[<<"bucket">>,<<"key2">>,<<"tag">>]]} * Note, the above query used '_' for the tag portion of the query to indicate that any tag is acceptable. Thank you, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. d...@basho.com On Sun, Jun 27, 2010 at 6:36 AM, Dmitry Demeshchuk wrote: > On Fri, Jun 25, 2010 at 8:25 PM, Ryan Tilder wrote: > > Hi, Dmitry. There are some gaps in the information you included here > that > > might help clarify what's going on so I'm going to just rattle off some > > questions for clarification. > > Is your test driver only making requests of a single EC2 instance? Or > are > > you querying all 7 nodes directly in so sort of load distribution? If > you > > aren't querying all 7 nodes directly, then you will likely see > performance > > on par with a cluster with only a single "physical" node. > > I tried both ways: querying only one node and querying all the nodes. > The results were approximately the same. But as far as I understand, > for map-reduce queries it's an expected result, isn't it? > > > Are you certain that the 7 nodes are communicating with each other? The > > output of the "riak-admin status" command should list the nodes in the > > "ring_members" field. > > Yes, sure. > > > Are the "documents" a separate key with Riak's built in links to the > > "entities" or are they keys with a data blob that refer to the > entities?[1] > > If the latter, have you > > read http://blog.basho.com/2010/02/24/link-walking-by-example/ ? > > To simplify, document data (I mean, value in Riak database) had > structure like this: > > [{entities, [123, 456, 745, 2352, 235 | ...]}]. > > I actually used timestamps in microseconds for Ids but that doesn't > really matter. > And, regarding your last question, documents and entities were stored > in different buckets. > > What about links? Should they give better speed in that case? Also, > neither Erlang native API (I mean riak_client module) nor Erlang PBC > seem to have link-walking functions like REST API. > > > > It's also important for me to note that EC2 instances do not necessarily > > have the same characteristics of actual physical hardware when it comes > to > > preventing resource contention. Since EC2 instances are virtualized, you > > have no idea what other load the physical host of a given instance may be > > under. As a result it is possible to have a Riak instance running on the > > same hardware as another IO and CPU intensive instance without your > > knowledge, impeding each other to a certain degree. We've had a number > of > > users complain of performance problems with Riak clusters running on EC2 > at > > various times. From my personal and anecdotal experience, EC2 seems to > be > > pretty heavily oversubscribed much of the time which leads to > intermittent > > performance issues for all kinds of applications. > > All of that is just a long winded way of saying: don't expect shared > > virtualized resources to provide the same performance as dedicated > physical > > hardware. But you should still see at least somewhat better performance > > that you're seeing now if your testing harness is testing properly. > > Sure, I understand that. But I expected at least a bit better performance. > > Anyway, the day before yesterday I ran some tests using basho_bench. > These tests cheered me up a bit :) > Here's the link to the results: > > http://demmonoid.livejournal.com/4098.html > > Please let me know if you want me to add or correct any links to your > resources or add any more information about the tests. > > > --Ryan > > 1. I'm not certain if
Linked associations in Ripple
Hey Ruby+Riak users, I've just pushed to ripple:master an initial implementation of inter-Document associations via links, and I'm looking for feedback. The implementation is much simpler than I expected it to be, but I'm aware there are a number of special cases and expected behavior that I haven't covered (failing specs would be appreciated). Here's what does work: * Singular (One) associations, e.g. one :avatar * Multiple (Many) associations:, e.g. many :tasks * Assignment of associated documents that have keys or have been saved already * Loading of associated documents via link-walking Known problems/missing features: * Associated documents aren't saved automatically when assigned * No validation that the assignment isn't given garbage (i.e. wrong type) * No verification of the number of extant links in a "one" association (although new ones will get a single link) * You can create linked associations on EmbeddedDocument classes. (don't do that!) * No :through/:via associations * Spec coverage is pretty low (2 examples). I invite you to clone/update to the latest, try it out and give me feedback. While this is gelling, I'll be adding by-key associations (easy) and by-bucket associations (tricky). Sean Cribbs Developer Advocate Basho Technologies, Inc. http://basho.com/ ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Riak and Amazon EC2
Hi Dan, Thanks for the explanation. I haven't been looking into the MapReduce wiki page for a while so I missed that point about link walking. Also, thanks for the clarification that reduce phase is performed on the initiating node. Though I didn't use reduce functionality for now, it's an important fact to consider. Thank you again. On Sun, Jun 27, 2010 at 8:28 PM, Dan Reverri wrote: > Hi Dmitry, > Regarding map reduce query performance in a cluster, map phases are run in > parallel so adding more machines to the cluster means more map functions can > be run simultaneously. Reduce phases currently only run on the node that > initiated the query so additional machines will not affect the performance > of reduce phases. More information regarding map reduce is available on the > wiki: > https://wiki.basho.com/display/RIAK/MapReduce > This portion of the wiki discusses how Riak spreads map reduce queries: > https://wiki.basho.com/display/RIAK/MapReduce#MapReduce-HowRiakSpreadsProcessing > > Regarding links, they are a special type of map phase. A link phase collects > the bucket/key values defined in the "Link" header of an object and passes > them to the next phase. Link phases are accessible to the Erlang clients by > defining them in a map reduce query. For example, the following can be run > from "riak console": > %% get a local client > {ok,C} = riak:local_client(). > %% define a metadata dict with links > MD = dict:store(<<"Links">>, > [{{<<"bucket">>,<<"key2">>},<<"tag">>}],dict:new()). > RO = riak_object:new(<<"bucket">>,<<"key1">>,<<"value">>). > RO1 = riak_object:update_metadata(RO, MD). > %% save the object > C:put(RO1,1). > %% run a link phase in a map reduce query > C:mapred([{<<"bucket">>,<<"key1">>}],[{link, <<"bucket">>, '_', true}]). > The map reduce query returns: {ok,[[<<"bucket">>,<<"key2">>,<<"tag">>]]} > * Note, the above query used '_' for the tag portion of the query to > indicate that any tag is acceptable. > > Thank you, > Dan > Daniel Reverri > Developer Advocate > Basho Technologies, Inc. > d...@basho.com > > > On Sun, Jun 27, 2010 at 6:36 AM, Dmitry Demeshchuk > wrote: >> >> On Fri, Jun 25, 2010 at 8:25 PM, Ryan Tilder wrote: >> > Hi, Dmitry. There are some gaps in the information you included here >> > that >> > might help clarify what's going on so I'm going to just rattle off some >> > questions for clarification. >> > Is your test driver only making requests of a single EC2 instance? Or >> > are >> > you querying all 7 nodes directly in so sort of load distribution? If >> > you >> > aren't querying all 7 nodes directly, then you will likely see >> > performance >> > on par with a cluster with only a single "physical" node. >> >> I tried both ways: querying only one node and querying all the nodes. >> The results were approximately the same. But as far as I understand, >> for map-reduce queries it's an expected result, isn't it? >> >> > Are you certain that the 7 nodes are communicating with each other? The >> > output of the "riak-admin status" command should list the nodes in the >> > "ring_members" field. >> >> Yes, sure. >> >> > Are the "documents" a separate key with Riak's built in links to the >> > "entities" or are they keys with a data blob that refer to the >> > entities?[1] >> > If the latter, have you >> > read http://blog.basho.com/2010/02/24/link-walking-by-example/ ? >> >> To simplify, document data (I mean, value in Riak database) had >> structure like this: >> >> [{entities, [123, 456, 745, 2352, 235 | ...]}]. >> >> I actually used timestamps in microseconds for Ids but that doesn't >> really matter. >> And, regarding your last question, documents and entities were stored >> in different buckets. >> >> What about links? Should they give better speed in that case? Also, >> neither Erlang native API (I mean riak_client module) nor Erlang PBC >> seem to have link-walking functions like REST API. >> >> >> > It's also important for me to note that EC2 instances do not necessarily >> > have the same characteristics of actual physical hardware when it comes >> > to >> > preventing resource contention. Since EC2 instances are virtualized, >> > you >> > have no idea what other load the physical host of a given instance may >> > be >> > under. As a result it is possible to have a Riak instance running on >> > the >> > same hardware as another IO and CPU intensive instance without your >> > knowledge, impeding each other to a certain degree. We've had a number >> > of >> > users complain of performance problems with Riak clusters running on EC2 >> > at >> > various times. From my personal and anecdotal experience, EC2 seems to >> > be >> > pretty heavily oversubscribed much of the time which leads to >> > intermittent >> > performance issues for all kinds of applications. >> > All of that is just a long winded way of saying: don't expect shared >> > virtualized resources to provide the same performance as dedicated >> > physical >> > hardware. But you sh