Re: Riak and Amazon EC2

Dan Reverri Sun, 27 Jun 2010 09:28:38 -0700

Hi Dmitry,

Regarding map reduce query performance in a cluster, map phases are run in
parallel so adding more machines to the cluster means more map functions can
be run simultaneously. Reduce phases currently only run on the node that
initiated the query so additional machines will not affect the performance
of reduce phases. More information regarding map reduce is available on the
wiki:
https://wiki.basho.com/display/RIAK/MapReduce


This portion of the wiki discusses how Riak spreads map reduce queries:
https://wiki.basho.com/display/RIAK/MapReduce#MapReduce-HowRiakSpreadsProcessing


Regarding links, they are a special type of map phase. A link phase collects
the bucket/key values defined in the "Link" header of an object and passes
them to the next phase. Link phases are accessible to the Erlang clients by
defining them in a map reduce query. For example, the following can be run
from "riak console":

%% get a local client
{ok,C} = riak:local_client().

%% define a metadata dict with links
MD = dict:store(<<"Links">>,
[{{<<"bucket">>,<<"key2">>},<<"tag">>}],dict:new()).
RO = riak_object:new(<<"bucket">>,<<"key1">>,<<"value">>).
RO1 = riak_object:update_metadata(RO, MD).

%% save the object
C:put(RO1,1).

%% run a link phase in a map reduce query
C:mapred([{<<"bucket">>,<<"key1">>}],[{link, <<"bucket">>, '_', true}]).

The map reduce query returns: {ok,[[<<"bucket">>,<<"key2">>,<<"tag">>]]}

* Note, the above query used '_' for the tag portion of the query to
indicate that any tag is acceptable.


Thank you,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
d...@basho.com


On Sun, Jun 27, 2010 at 6:36 AM, Dmitry Demeshchuk <demeshc...@gmail.com>wrote:

> On Fri, Jun 25, 2010 at 8:25 PM, Ryan Tilder <rtil...@basho.com> wrote:
> > Hi, Dmitry.  There are some gaps in the information you included here
> that
> > might help clarify what's going on so I'm going to just rattle off some
> > questions for clarification.
> > Is your test driver only making requests of a single EC2 instance?  Or
> are
> > you querying all 7 nodes directly in so sort of load distribution?   If
> you
> > aren't querying all 7 nodes directly, then you will likely see
> performance
> > on par with a cluster with only a single "physical" node.
>
> I tried both ways: querying only one node and querying all the nodes.
> The results were approximately the same. But as far as I understand,
> for map-reduce queries it's an expected result, isn't it?
>
> > Are you certain that the 7 nodes are communicating with each other?  The
> > output of the "riak-admin status" command should list the nodes in the
> > "ring_members" field.
>
> Yes, sure.
>
> > Are the "documents" a separate key with Riak's built in links to the
> > "entities" or are they keys with a data blob that refer to the
> entities?[1]
> >  If the latter, have you
> > read http://blog.basho.com/2010/02/24/link-walking-by-example/ ?
>
> To simplify, document data (I mean, value in Riak database) had
> structure like this:
>
> [{entities, [123, 456, 745, 2352, 235 | ...]}].
>
> I actually used timestamps in microseconds for Ids but that doesn't
> really matter.
> And, regarding your last question, documents and entities were stored
> in different buckets.
>
> What about links? Should they give better speed in that case? Also,
> neither Erlang native API (I mean riak_client module) nor Erlang PBC
> seem to have link-walking functions like REST API.
>
>
> > It's also important for me to note that EC2 instances do not necessarily
> > have the same characteristics of actual physical hardware when it comes
> to
> > preventing resource contention.  Since EC2 instances are virtualized, you
> > have no idea what other load the physical host of a given instance may be
> > under.  As a result it is possible to have a Riak instance running on the
> > same hardware as another IO and CPU intensive instance without your
> > knowledge, impeding each other to a certain degree.  We've had a number
> of
> > users complain of performance problems with Riak clusters running on EC2
> at
> > various times.  From my personal and anecdotal experience, EC2 seems to
> be
> > pretty heavily oversubscribed much of the time which leads to
> intermittent
> > performance issues for all kinds of applications.
> > All of that is just a long winded way of saying: don't expect shared
> > virtualized resources to provide the same performance as dedicated
> physical
> > hardware.  But you should still see at least somewhat better performance
> > that you're seeing now if your testing harness is testing properly.
>
> Sure, I understand that. But I expected at least a bit better performance.
>
> Anyway, the day before yesterday I ran some tests using basho_bench.
> These tests cheered me up a bit :)
> Here's the link to the results:
>
> http://demmonoid.livejournal.com/4098.html
>
> Please let me know if you want me to add or correct any links to your
> resources or add any more information about the tests.
>
> > --Ryan
> > 1. I'm not certain if you're saying that the documents are stored in a
> > separate bucket from the entities in the same Riak cluster or a separate
> > Riak cluster entirely.
> > On Fri, Jun 25, 2010 at 12:02 AM, Dmitry Demeshchuk <
> demeshc...@gmail.com>
> > wrote:
> >>
> >> Greetings.
> >>
> >> I tried running Riak with bitcask backend on 7 Amazon EC2 standard
> >> large instances (7.5 GB RAM, 4 EC2 CPU units) and performed some
> >> tests.
> >> For comparison, I built up the following Riak clusters:
> >>
> >> 7 physical nodes ring
> >> 1 physical node ring (on one of the 7 instances, but I ran the tests
> >> separately so the rings won't mess with each other)
> >> 1 physical node ring on an extra large instance (15 GB RAM, 8EC2 CPU
> >> units)
> >>
> >> and ran a couple of tests with putting and getting data using Riak
> >> native Erlang API (not PBC).
> >>
> >> I had 2 buckets, the first one having small (averagely about 1KB)
> >> values, but a lot of them (about several millions) called "entities",
> >> and the second one having lists of keys from the first database,
> >> called "documents". So, every document consists of a lot of entities
> >> (I used 100 and 1000 for my tests). So, the approximate size of every
> >> document was either 100KB or 1MB.
> >>
> >> So, I performed tests of putting documents and entities to database
> >> and then obtaining them. I tried to perform reads and writes using 10
> >> and 100 concurrent Erlang processes (well, 100 was generally too much
> >> as I ran out of CPU), first from only one machine and then from 2 and
> >> 3 machines at the same time (for the 7-nodes ring). Of course, the
> >> entities were obtained using map-reduce.
> >>
> >> The first weird thing was that even with 10 concurrent reads and
> >> writes the performance didn't differ for all three clusters. Okay, 1
> >> large and 1 extra large nodes don't differ so much but the 7 nodes
> >> should have given me some performance, shouldn't they?
> >>
> >> The second thing was that the average read time for one document with
> >> 1000 entities was about 5 seconds, and again, the number of machines
> >> in the cluster didn't affect the result. I guess I just stumbled upon
> >> the performance of the instance that sent all the map-reduce requests
> >> and then collected the replies because when I ran tests on the other 2
> >> instances, all three had the same performance.
> >>
> >> The other strange thing was that during data writes most of the time
> >> nodes were not io-loaded. If it was a one-stream write, it would be
> >> obvious. But it were 10 and then 20 and 30 simultaneous writing
> >> processes!
> >>
> >>
> >> Unfortunately I cannot provide the detailed results now, they are
> >> pretty messed up. I'm going to use basho_bench to make good graphs and
> >> tables of these tests.
> >>
> >> Any advises for the future tests or any explanations for such strange
> >> performance?
> >>
> >> Thank you in advance and sorry for a little messed up e-mail.
> >>
> >> --
> >> Best regards,
> >> Dmitry Demeshchuk
> >>
> >> _______________________________________________
> >> riak-users mailing list
> >> riak-users@lists.basho.com
> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
> >
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak and Amazon EC2

Reply via email to