Michael, These are some interesting observations. Thank you for digging into this more. At first glance it appears that it has something to do with the PB client/server. If you are feeling adventurous might I suggest using fprof to analyze where the time is being spent. I wrote up a gist on how you can do that here:
https://gist.github.com/2287383 I ran this analysis myself on a much smaller dataset. A quick glance at the results shows most of the time being spent in suspend, encode/decode (which I suspected), and queuing data in riak_pipe. I'm really curious to see what your analysis looks like. If you can produce one I'm happy to help analyze. Thanks again for your detailed analysis thus far. -Ryan On Mon, Apr 2, 2012 at 2:33 PM, Michael Radford <m...@blorf.com> wrote: > I modified the pb client code to print the elapsed time when it > receives the first map-reduce result (in the function > wait_for_mapred), and tried to run a more apples-to-apples comparison > of the same map-reduce job via the local client and the pb client > connecting to 127.0.0.1. > > In a nutshell, the majority of the extra delay from the pb client > occurs before it receives the first result (e.g., 6 out of 7 seconds). > (It's still troubling that it takes another second to receive the rest > of the results when reading from localhost, but maybe I can avoid that > by calling a different reduce phase that somehow packages my results > into a single message?) > > Next, it seems like I should try instrumenting the server side of the > protobufs interface with timing...any pointers on the best place(s) to > do that would be appreciated. > > Here is the function I used with the pb client to localhost, and the > results: > TP = > fun (Bucket, Query) -> > {ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 8087), > Inputs = {modfun, riak_search, mapred_search, [Bucket, Query]}, > Phases = > [ {reduce, > {modfun, riak_kv_mapreduce, reduce_identity}, > [{reduce_phase_only_1, true}], > true} ], > {MuSec, {ok, [{0, Results}]}} = > timer:tc (fun () -> riakc_pb_socket:mapred(Pid, Inputs, Phases) end), > riakc_pb_socket:stop (Pid), > io:format(user, "~b results, ~f sec~n", [length(Results), > MuSec/1000000]) > end. > > 3> TP(Bucket,<<"full_text:flower">>). > first M-R result in 5848490 musec > 14378 results, 7.235226 sec > ok > 4> TP(Bucket,<<"full_text:flower">>). > first M-R result in 5978044 musec > 14378 results, 7.423933 sec > ok > 5> TP(Bucket,<<"full_text:flower">>). > first M-R result in 1219112 musec > 14378 results, 2.563079 sec > ok > 6> TP(Bucket,<<"full_text:flower">>). > first M-R result in 6017087 musec > 14378 results, 7.326352 sec > ok > 7> TP(Bucket,<<"full_text:green">>). > first M-R result in 9546654 musec > 24046 results, 13.561474 sec > ok > 8> TP(Bucket,<<"full_text:green">>). > first M-R result in 1878109 musec > 24047 results, 5.188646 sec > ok > 9> > 9> TP(Bucket,<<"full_text:green">>). > first M-R result in 9724651 musec > 24047 results, 12.692941 sec > ok > 10> TP(Bucket,<<"full_text:green">>). > first M-R result in 9627592 musec > 24047 results, 12.769256 sec > ok > > And here is the function I used for the local client, and the results: > TL = > fun (Bucket, Query) -> > {ok, Client} = riak:local_client(), > Inputs = {modfun, riak_search, mapred_search, [Bucket, Query]}, > Phases = > [ {reduce, > {modfun, riak_kv_mapreduce, reduce_identity}, > [{reduce_phase_only_1, true}], > true} ], > {MuSec, {ok, Results}} = > timer:tc(fun () -> Client:mapred(Inputs, Phases) end), > io:format(user, "~b results, ~f sec~n", [length(Results), > MuSec/1000000]) > end. > > (riak@10.174.223.37)27> TL(Bucket,<<"full_text:flower">>). > 14340 results, 0.106422 sec > ok > (riak@10.174.223.37)28> TL(Bucket,<<"full_text:flower">>). > 14340 results, 0.102255 sec > ok > (riak@10.174.223.37)29> TL(Bucket,<<"full_text:flower">>). > 14340 results, 0.099877 sec > ok > (riak@10.174.223.37)30> TL(Bucket,<<"full_text:blue">>). > 25017 results, 0.210381 sec > ok > (riak@10.174.223.37)31> TL(Bucket,<<"full_text:blue">>). > 25017 results, 0.200112 sec > ok > (riak@10.174.223.37)32> TL(Bucket,<<"full_text:blue">>). > 25018 results, 0.207553 sec > ok > (riak@10.174.223.37)33> TL(Bucket,<<"full_text:green">>). > 23945 results, 0.173061 sec > ok > (riak@10.174.223.37)34> TL(Bucket,<<"full_text:green">>). > 23945 results, 0.191595 sec > ok > (riak@10.174.223.37)35> TL(Bucket,<<"full_text:green">>). > 23945 results, 0.177821 sec > ok > > Mike > > On Mon, Apr 2, 2012 at 8:46 AM, Michael Radford <m...@blorf.com> wrote: > > Hi Ryan, > > > > This is getting interesting: the same queries when executed using > > local clients from 'riak attach' are taking only 100-250 ms. > > > > However, I just tried the same test I was running remotely on Saturday > > on one of the machines in the Riak cluster, using the protobufs client > > to connect to 127.0.0.1, and it's still taking 6-7 seconds per query. > > (Still with an occasional dip down to 2-3 seconds). > > > > These machines are on Amazon EC2, so I have little control over the > > network layout. But that includes the 4 Riak boxes, so if their > > inter-node communication was suffering from similar issues I would > > have expected to see it affecting other map-reduce queries. (We're > > running lots of multi-key lookups via map-reduce that return thousands > > of objects in a few ms, much larger than the keys returned from these > > searches.) > > > > And again, there is a huge difference between the same query using the > > local client (200 ms), and using the protobufs client from the exact > > same Riak machine connecting to localhost (6-7 sec but occasionally > > 2-3 sec). > > > > Mike > > > > On Mon, Apr 2, 2012 at 7:44 AM, Ryan Zezeski <rzeze...@basho.com> wrote: > >> Hi Michael, you'll find my responses inline... > >> > >> On Sat, Mar 31, 2012 at 5:04 PM, Michael Radford <m...@blorf.com> > wrote: > >>> > >>> I'm seeing very slow performance from Riak search even when querying > >>> single terms, and I'd appreciate any advice on how to get insight into > >>> where the time is going. > >>> > >>> Right now, I'm using this function to time queries with the Erlang pb > >>> client: > >>> > >>> TS = > >>> fun (Pid, Bucket, Query) -> > >>> T0 = now(), > >>> {ok, Results} = riakc_pb_socket:search(Pid, Bucket, Query), > >>> MuSec = timer:now_diff(now(), T0), > >>> io:format(user, "~b results, ~f sec~n", [length(Results), > >>> MuSec/1000000]) > >>> end. > >> > >> > >> Just an FYI, you should checkout `timer:tc`. > >>> > >>> > >>> The bucket I'm querying currently has ~300k keys total (each 16 > >>> bytes). (The whole cluster has maybe 1.5M entries in about a dozen > >>> buckets. It's running 1.0.2, 4 nodes on 4 8-core c1.xlarge EC2 > >>> instances.) > >>> > >>> For single-term queries that return 10k+ keys, I'm routinely seeing > >>> times above 6 seconds to run the above function. Intermittently, > >>> however, I'll see the same queries take only 2 seconds: > >>> > >>> > TS(Pid,Bucket,<<"full_text:flower">>). > >>> 12574 results, 6.094149 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:flower">>). > >>> 12574 results, 1.938894 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:flower">>). > >>> 12574 results, 1.981492 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:flower">>). > >>> 12574 results, 6.120589 sec > >>> ok > >>> > >>> > TS(Pid,Bucket,<<"full_text:red">>). > >>> 13461 results, 6.572473 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:red">>). > >>> 13461 results, 6.626049 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:red">>). > >>> 13461 results, 2.155847 sec > >>> ok > >>> > >>> Queries with fewer results are still slow, but not quite as slow as 6 > >>> seconds: > >>> > >>> > TS(Pid,Bucket,<<"full_text:ring">>). > >>> 6446 results, 2.831806 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:ring">>). > >>> 6446 results, 3.037162 sec > >>> ok > >>> > TS(Pid,Bucket,<<"full_text:ring">>). > >>> 6447 results, 0.780944 sec > >>> ok > >>> > >>> And queries with no matches only take a few milliseconds: > >>> > >>> > TS(Pid,Bucket,<<"full_text:blorf">>). > >>> 0 results, 0.003269 sec > >>> ok > >>> > >>> During the slow queries, none of the 4 machines seems to be fully > >>> taxing even one cpu, or doing almost any disk i/o. > >> > >> > >> What does intra/inter network look like? > >> > >>> > >>> > >>> As far as I can tell from looking at the riak_kv/riak_search source, > >>> my query should only be hitting the index and streaming back the keys, > >>> not trying to read every document from disk or sort by score. Is that > >>> correct? > >> > >> > >> It will not read the documents at all but it will sort on score. > Currently > >> there is no way to disable sorting. > >> > >>> > >>> > >>> Assuming that's the case, I can't imagine why it takes so long to look > >>> up 10k keys from the index for a single term, or why the times seem to > >>> be bimodal (which seems like a big clue). Any pointers welcome! > >> > >> > >> Where is your client sitting in regards to your cluster? Is it in the > local > >> network? Could you try attaching to one of your riak nodes, running the > >> query there and compare results? > >> > >> e.g. > >> > >> riak attach > >> > >>> search:search(Bucket, Query). > >> > >> -Ryan > >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com