I didn't want you to think that you've been forgotten, but I've been swamped getting ready to head out of the country for 2 weeks on a company trip. You're in good hands with the list, though.
--- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop On Tue, Feb 26, 2013 at 4:36 PM, Kevin Burton <rkevinbur...@charter.net>wrote: > I got the same error on an AWS instance (m1.xlarge) when using the > MapReduce version of list all keys.**** > > ** ** > > Query failed with Riak returned an error. Code '0'. Message: > {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"buyseasons-products\">>,<<\"00113023\">>,[{r_content,{dict,4,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,49,54,99,97,90,90,72,106,56,50,100,77,85,75,66,114,76,50,109,88,89,109]],[[<<\"index\">>,{<<\"active_bin\">>,<<\"InactiveDiscontinued\">>},{<<\"definition_bin\">>,<<\"Costume\">>},{<<\"department_bin\">>,<<\"Adult > Costumes\">>},{<<\"...\">>,...}]],...}}},...}],...},...}","type":"forward_preflist","stack":"[]"} > – CommunicationError**** > > ** ** > > The ‘Department’ MapReduce with an AWS instance also returned three > ‘failed’ phases.**** > > ** ** > > Kevin**** > > ** ** > > *From:* riak-users [mailto:riak-users-boun...@lists.basho.com] *On Behalf > Of *Jeremiah Peschka > *Sent:* Tuesday, February 26, 2013 3:18 PM > > *To:* riak-users > *Subject:* Re: MapReduce performance problem**** > > ** ** > > Responses inline.**** > > ** ** > > ---**** > > Jeremiah Peschka - Founder, Brent Ozar Unlimited**** > > MCITP: SQL Server 2008, MVP**** > > Cloudera Certified Developer for Apache Hadoop**** > > ** ** > > On Tue, Feb 26, 2013 at 12:26 PM, Kevin Burton <rkevinbur...@charter.net> > wrote:**** > > Right. I know it is not ideal. I have been able to split the VM’s into > groups. So 2 of the 4 are running on separate hardware. Anything more I > just get the response ‘get real’. That being said I want to get the maximum > performance of the limited resources that I have. I have a separate > question for the group in trying to get basho_bench up and running (I get a > long string of errors). What do you need to know more about my environment > to “understand” it? I am new so I am probably asking the wrong questions so > please tell me what you are missing that might help diagnose the problem.* > *** > > ** ** > > For troubleshooting any environment it's good to know relevant hardware > details about CPU speed, core count, amount of RAM, disks, network cards, > etc. A working basho_bench benchmark would help, too, because it will > provide an indicator of how your environment will perform with Riak as > opposed to how your business logic performs, as implemented, with Riak.*** > * > > ** ** > > For virtualization, it's also important to know how many other guests are > on the host, whether there are any CPU, memory, or network reservations in > place, and which version of virtualization you're running. Virtualization > makes performance tuning more complex, but not impossible. **** > > **** > > I agree. I will use ListAllKeysFromIndex to get a list of keys for now. > The only reason that I included the m/r code is because of the error. If I > get another m/r job with similar output I need to know how to diagnose the > problem. I was using JavaScript m/r because I kind of understand > JavaScript. Is there a separate task to do Erlang m/r jobs.**** > > Erlang phases can be added to a MapReduceQuery using MapErlang and > ReduceErlang. **** > > I assume that I will need to know Erlang. Any recommendations on how best > to know what I need to know about Erlang to write a m/r job. But before I > do that wouldn’t it be prudent to know that the source if the problem is > indeed JavaScript? How would I pinpoint that?**** > > I think this has been answered on list. I'd search > http://riak.markmail.org**** > > ** ** > > Someone from Basho can probably handle this better than my handwaving that > using JavaScript involves an interpreter, type marshaling between Erlang > and JavaScript, and won't multi-thread like Erlang will.**** > > These two m/r jobs are basically an example of using m/r that would be > typical for our application. Just for sheer maintenance we wouldn’t want > to go down the path of maintaining a counter for all the fields that we > have. There could be departments, categories, celebrations, . . . Basically > a lot of them. For all intents it is an ad hoc query. If that is one of > the limitations then we will have to note it and see if coping with this > limitation is too onerous.**** > > ** ** > > MR queries are going to scan all of your data on disk. If you have 5 nodes > that can read at ~100 MB/s and you have 100GB of data, how long will it > take for your ad hoc query to run? **** > > ** ** > > Riak Search/Lucene/Yokozuna will be better options for ad hoc workloads > than MapReducing across the cluster.**** > > **** > > *From:* riak-users [mailto:riak-users-boun...@lists.basho.com] *On Behalf > Of *Jeremiah Peschka > *Sent:* Tuesday, February 26, 2013 1:33 PM > *To:* riak-users > *Subject:* Re: MapReduce performance problem**** > > **** > > Before you go troubleshooting performance problems, I'd focus on getting > results out of basho_bench and getting a good understanding of your > environment. If you're running 4 guests with 1 vCPU each on the same VM > host with all guests sharing a single pool of disks, no amount of tuning > will solve that problem. Without an understanding of the operating > environment, we can't do much more than point at general best practices and > say "these might help you, not sure, though."**** > > **** > > As far as your specifics - for the first query, if you're attempting to > get a list of keys, I still recommend using ListAllKeysFromIndex(string > Bucket). This will be pushed out as part of the IRiakClient interface in > the next day or two, and I'm sure the gods of OOP won't kill you for using > an actual RiakClient object instead of an IRiakClient interface between now > and then. Sending those results back directly from riak_kv is going to be > far faster than messing around with a JavaScript MapReduce job.**** > > **** > > Always keep in mind that MR jobs are not going to be the most efficient > way to perform any kind of ad hoc querying - they're great for large scale > data transformations but if you really want performance, you'll want to > write Erlang MR jobs. **** > > **** > > If you need to maintain counts per department, a better approach will be > persisting counters and maintaining those counts via some kind of > caching/pre-aggregation mechanism, most likely outside of Riak because of > eventual consistency guarantees. Alex Siculars will eventually show up and > start chanting "use redis"; you'll be resistant at first, but his arguments > make a lot of sense. Riak does some things very well, maintaining > consistent counters isn't one of them... yet.**** > > > **** > > ---**** > > Jeremiah Peschka - Founder, Brent Ozar Unlimited**** > > MCITP: SQL Server 2008, MVP**** > > Cloudera Certified Developer for Apache Hadoop**** > > **** > > On Tue, Feb 26, 2013 at 10:52 AM, Kevin Burton <rkevinbur...@charter.net> > wrote:**** > > I have a simple CorrugatedIron client that makes the following request:*** > * > > **** > > IRiakClient riakClient = cluster.CreateClient();**** > > RiakBinIndexRangeInput bucketKeyInput = new > RiakBinIndexRangeInput(productBucketName, "$key", "00000000", "99999999"); > **** > > RiakMapReduceQuery query = new RiakMapReduceQuery()**** > > .Inputs(bucketKeyInput)**** > > .MapJs(m => m.Name("Riak.mapValuesJson").Keep(true));** > ** > > RiakResult<RiakMapReduceResult> result = > riakClient.MapReduce(query);**** > > **** > > So as you can see this is a very basic range m/r query. But the result > comes back as:**** > > **** > > Riak returned an error. Code '0'. Message: timeout**** > > CommunicationError**** > > **** > > Another type of m/r query I have**** > > **** > > IRiakClient riakClient = cluster.CreateClient();**** > > var query = new RiakMapReduceQuery()**** > > .Inputs(productBucketName)**** > > .MapJs(m => m.Source(@"function(v,d,a) {" +**** > > "var p = JSON.parse(v.values[0].data);" +**** > > "var r = [];" +**** > > "d = escape(p.Department);" +**** > > "if(d != '') {" +**** > > "var o = {};" +**** > > "o[d] = 1;" +**** > > "r.push(o);" +**** > > "}" +**** > > "return r;" +**** > > "}"))**** > > .ReduceJs(m => m.Source(@"function(v,d,a) {" +**** > > "var r = {};" +**** > > "for(var i in v) {" +**** > > " for(var w in v[i]) {" +**** > > " if(w in r) r[w] += v[i][w];" +**** > > " else r[w] = v[i][w];" +**** > > " }" +**** > > "}" +**** > > "return [r];" +**** > > "}")**** > > .Keep(true));**** > > **** > > This returns but it takes far too long. I have about 60,000 items in my > bucket and this takes about 50-60 seconds to execute. The results seem > valid. For these types of m/r jobs what can I do on the server (or client) > to helo diagnose the problem. I have basic tools like iostat and top to > give me data but some pointers on using the output of these tools might > help.**** > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com**** > > **** > > ** ** >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com