Spaces in the search string
Hi List, We have a solr index where we store something like: <<"{\"key_s\":\"ID\",\"body_s\":\"some test string\"}">>}], Then we try to do a riakc_pb_socket:search with the pattern: <<"body_s:*test str*">> The request will fail with an error message telling us to check the logs and in there we find: 2016-09-05 13:37:29.271 [error] <0.12067.10>@yz_pb_search:maybe_process:107 {solr_error,{400," http://localhost:10014/internal_solr/crm_db.campaign_index/select";,<<"{\"error\":{\"msg\":\"no field name specified in query and no default specified via 'df' param\",\"code\":400}}\n">>}} [{yz_solr,search,3,[{file,"src/yz_solr.erl"},{line,284}]},{yz_pb_search,maybe_process,3,[{file,"src/yz_pb_search.erl"},{line,78}]},{riak_api_pb_server,process_message,4,[{file,"src/riak_api_pb_server.erl"},{line,388}]},{riak_api_pb_server,connected,2,[{file,"src/riak_api_pb_server.erl"},{line,226}]},{riak_api_pb_server,decode_buffer,2,[{file,"src/riak_api_pb_server.erl"},{line,364}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}] Through experiment I've figured out that it doesn't like the space as it seems to think the part of the search string after that space is a new key to search for. Which seems fair enough. Anyone know of a work-around? Or am I formatting my request incorrectly? Thanks in advance. //Sean. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Spaces in the search string
Hi Jason, Thanks for the kick, I just needed to look closer! Yes, had tried escaping but one of my utility functions for dynamically building the search string had been stripping it out again. D'oh! Curiously, just escaping the space doesn't work as in the example in the stackoverflow post. Putting the search term in an inner string and escaping its quotes both feels more natural and does work so I'm going with something more like: 409> 409> 409> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"we rt\" AND age_i:0">>, []). {ok,{search_results,[],0.0,0}} 410> 410> 410> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:we\ rt AND age_i:0">>, []). {error,<<"Query unsuccessful check the logs.">>} 411> 411> Cheers, //Sean. On Tue, Sep 6, 2016 at 2:48 PM, Jason Voegele wrote: > Hi Sean, > > Have you tried escaping the space in your query? > > http://stackoverflow.com/questions/10023133/solr- > wildcard-query-with-whitespace > > > On Sep 5, 2016, at 6:24 PM, sean mcevoy wrote: > > Hi List, > > We have a solr index where we store something like: > <<"{\"key_s\":\"ID\",\"body_s\":\"some test string\"}">>}], > > Then we try to do a riakc_pb_socket:search with the pattern: > <<"body_s:*test str*">> > > The request will fail with an error message telling us to check the logs > and in there we find: > > 2016-09-05 13:37:29.271 [error] <0.12067.10>@yz_pb_search:maybe_process:107 > {solr_error,{400,"http://localhost:10014/internal_solr/ > crm_db.campaign_index/select",<<"{\"error\":{\"msg\":\"no field name > specified in query and no default specified via 'df' > param\",\"code\":400}}\n">>}} [{yz_solr,search,3,[{file," > src/yz_solr.erl"},{line,284}]},{yz_pb_search,maybe_process, > 3,[{file,"src/yz_pb_search.erl"},{line,78}]},{riak_api_ > pb_server,process_message,4,[{file,"src/riak_api_pb_server. > erl"},{line,388}]},{riak_api_pb_server,connected,2,[{file," > src/riak_api_pb_server.erl"},{line,226}]},{riak_api_pb_ > server,decode_buffer,2,[{file,"src/riak_api_pb_server.erl"}, > {line,364}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{ > line,505}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib. > erl"},{line,239}]}] > > > Through experiment I've figured out that it doesn't like the space as it > seems to think the part of the search string after that space is a new key > to search for. Which seems fair enough. > > Anyone know of a work-around? Or am I formatting my request incorrectly? > > Thanks in advance. > //Sean. > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Spaces in the search string
Hi again! Apologies for the premature post earlier. I thought I had a solution when I didn't get the error but when I got around to plugging it into my application it's still not doing everything that I need. I've narrowed it down to this minimal testcase, first setup the index & insert the data: {ok,Pid} = riakc_pb_socket:start("127.0.0.1", 10017). ok = riakc_pb_socket:create_search_index(Pid, <<"test_index">>, <<"_yz_default">>, []). ok = riakc_pb_socket:set_search_index(Pid, <<"test_bucket">>, <<"test_index">>). RO = riakc_obj:new(<<"test_bucket">>, <<"test_key">>, <<"{\"name_s\":\"my test name\",\"age_i\":2}">>, "application/json"). ok = riakc_pb_socket:put(Pid, RO). Now I can get the hit when search for a partial name with wildcards & no escapes or spaces: 521> 521> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:*test* AND age_i:2">>, []). {ok,{search_results,[{<<"test_index">>, [{<<"score">>,<<"1.227760798549e+00">>}, {<<"_yz_rb">>,<<"test_bucket">>}, {<<"_yz_rt">>,<<"default">>}, {<<"_yz_rk">>,<<"test_key">>}, {<<"_yz_id">>,<<"1*default*test_bucket*test_key*57">>}, {<<"name_s">>,<<"my test name">>}, {<<"age_i">>,<<"2">>}]}], 1.2277607917785645,1}} And I can get the hit when I search for the full name with spaces & the escaped quotes: 522> 522> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"my test name\" AND age_i:2">>, []). {ok,{search_results,[{<<"test_index">>, [{<<"score">>,<<"1.007369608719e+00">>}, {<<"_yz_rb">>,<<"test_bucket">>}, {<<"_yz_rt">>,<<"default">>}, {<<"_yz_rk">>,<<"test_key">>}, {<<"_yz_id">>,<<"1*default*test_bucket*test_key*58">>}, {<<"name_s">>,<<"my test name">>}, {<<"age_i">>,<<"2">>}]}], 1.0073696374893188,1}} But how can I search for a partial name with spaces: 523> 523> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"*y test na*\" AND age_i:2">>, []). {ok,{search_results,[],0.0,0}} 524> 524> I get the feeling that I'm missing something really obvious but can't see it. Any more pointers appreciated! //Sean. On Wed, Sep 7, 2016 at 10:11 AM, sean mcevoy wrote: > Hi Jason, > > Thanks for the kick, I just needed to look closer! > Yes, had tried escaping but one of my utility functions for dynamically > building the search string had been stripping it out again. D'oh! > > Curiously, just escaping the space doesn't work as in the example in the > stackoverflow post. > Putting the search term in an inner string and escaping its quotes both > feels more natural and does work so I'm going with something more like: > > 409> > 409> > 409> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"we rt\" AND > age_i:0">>, []). > {ok,{search_results,[],0.0,0}} > 410> > 410> > 410> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:we\ rt AND > age_i:0">>, []). > {error,<<"Query unsuccessful check the logs.">>} > 411> > 411> > > Cheers, > //Sean. > > > On Tue, Sep 6, 2016 at 2:48 PM, Jason Voegele wrote: > >> Hi Sean, >> >> Have you tried escaping the space in your query? >> >> http://stackoverflow.com/questions/10023133/solr-wildcard- >> query-with-whitespace >> >> >> On Sep 5, 2016, at 6:24 PM, sean mcevoy wrote: >> >> Hi List, >> >> We have a solr index where we store something like: >> <<
Re: Spaces in the search string
Hi Alexander, Unfortunately it didn't shake with any satisfaction. I'm sure there's an easy answer, and I hope I'll get back to search for it some day. But for now me & my pragmatic overlords have gone for a work-around solution that avoids the problem. //Sean. On Wed, Sep 7, 2016 at 2:06 PM, Alexander Sicular wrote: > Hi Sean, Familiarize yourself with the default schema[0], if that is what > you're using. Also check details around this specific type of search around > the web[1]. > > Let us know how it shakes out, > -Alexander > > > [0] https://raw.githubusercontent.com/basho/yokozuna/develop/priv/default_ > schema.xml > [1] http://stackoverflow.com/questions/10023133/solr- > wildcard-query-with-whitespace > > > > On Wednesday, September 7, 2016, sean mcevoy > wrote: > >> Hi again! >> >> Apologies for the premature post earlier. I thought I had a solution when >> I didn't get the error but when I got around to plugging it into my >> application it's still not doing everything that I need. >> I've narrowed it down to this minimal testcase, first setup the index & >> insert the data: >> >> >> {ok,Pid} = riakc_pb_socket:start("127.0.0.1", 10017). >> ok = riakc_pb_socket:create_search_index(Pid, <<"test_index">>, >> <<"_yz_default">>, []). >> ok = riakc_pb_socket:set_search_index(Pid, <<"test_bucket">>, >> <<"test_index">>). >> RO = riakc_obj:new(<<"test_bucket">>, <<"test_key">>, >> <<"{\"name_s\":\"my test name\",\"age_i\":2}">>, "application/json"). >> ok = riakc_pb_socket:put(Pid, RO). >> >> >> Now I can get the hit when search for a partial name with wildcards & no >> escapes or spaces: >> 521> >> 521> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:*test* AND >> age_i:2">>, []). >> {ok,{search_results,[{<<"test_index">>, >> [{<<"score">>,<<"1.227760798549e+00">>}, >>{<<"_yz_rb">>,<<"test_bucket">>}, >>{<<"_yz_rt">>,<<"default">>}, >>{<<"_yz_rk">>,<<"test_key">>}, >>{<<"_yz_id">>,<<"1*default*tes >> t_bucket*test_key*57">>}, >>{<<"name_s">>,<<"my test name">>}, >>{<<"age_i">>,<<"2">>}]}], >> 1.2277607917785645,1}} >> >> >> And I can get the hit when I search for the full name with spaces & the >> escaped quotes: >> 522> >> 522> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"my test >> name\" AND age_i:2">>, []). >> {ok,{search_results,[{<<"test_index">>, >> [{<<"score">>,<<"1.007369608719e+00">>}, >>{<<"_yz_rb">>,<<"test_bucket">>}, >>{<<"_yz_rt">>,<<"default">>}, >>{<<"_yz_rk">>,<<"test_key">>}, >>{<<"_yz_id">>,<<"1*default*tes >> t_bucket*test_key*58">>}, >>{<<"name_s">>,<<"my test name">>}, >>{<<"age_i">>,<<"2">>}]}], >> 1.0073696374893188,1}} >> >> >> But how can I search for a partial name with spaces: >> 523> >> 523> riakc_pb_socket:search(Pid, <<"test_index">>, <<"name_s:\"*y test >> na*\" AND age_i:2">>, []). >> {ok,{search_results,[],0.0,0}} >> 524> >> 524> >> >> >> I get the feeling that I'm missing something really obvious but can't see >> it. Any more pointers appreciated! >> >> //Sean. >> >> >> On Wed, Sep 7, 2016 at 10:11 AM, sean mcevoy &
Solr search performance
Hi All, We have an index with ~548,000 entries, ~14,000 of which match one of our queries. We read these in a paginated search and the first page (of 100 hits) returns quickly in ~70ms. This response time seems to increase exponentially as we walk through the pages: the 4th page takes ~200ms, the 8th page takes ~1200ms the 12th page takes ~2100ms the 16th page takes ~6100ms the 20th page takes ~24000ms And by the time we're searching for the 22nd page it regularly times out at the default 60 seconds. I have a good unsderstanding of riak KV internals but absolutely nothing of Lucene which I think is what's most relevant here. If anyone in the know can point me towards any relevant resource or can explain what's happening I'd be much obliged :-) As I would also be if anyone with experience of using Riak/Lucene can tell me: - Is 500K a crazy number of entries to put into one index? - Is 14K a crazy number of entries to expect to be returned? - Are there any methods we can use to make the search time more constant across the full search? I read one blog post on inlining but it was a bit old & not very obvious how to implement using riakc_pb_socket calls. And out of curiosity, do we not traverse the full range of hits for each page? I naively thought that because I'm sorting the returned values we'd have to get them all first and then sort, but the response times suggests otherwise. Does Lucene store the data sorted by each field just in case a query asks for it? Or what other magic is going on? For the technical details, we use the "_yz_default" schema and all the fields stored are strings: - entry_id_s: unique within the DB, the aim of the query is to gather a list of these - type_s: has one of 2 values - sub_category_id_s: in the query described above all 14K hits will match on this, in the DB of ~500K entries there are ~43K different values for this field, withe each category typically having 2-6 sub categories - category_id_s: not matched in this query, in the DB of ~500K entries there are ~13K different values for this field - status_s: has one of 2 values, in the query described baove all hits will have the value "active" - user_id_s: unique within the DB but not matched in this query - first_name_s: almost unique within the DB, this query will sort by this field - last_name_s: almost unique within the DB, this query will sort by this field This search query looks like: <<"sub_category_id_s:test_1 AND status_s:active AND type_s:sub_category">> Our options parameter has the sort directive: {sort, <<"first_name_s asc, last_name_s asc">>} The query was run on a 5-node cluster with n_val of 3. Thanks in advance fo rany pointers! //Sean. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Solr search performance
Hi Fred, Thanks for the pointer! 'cursorMark' is a lot more performant alright, though apparently it doesn't suit our use case. I've written a loop function using OTP's httpc that reads each page, gets the cursorMark and repeats, and it returns all 147 pages with consistent times in the 40-60ms bracket which is an excellent improvement! I would have been asking about the effort involved in making the protocol buffers client support this, but instead our GUI guys insist that they need to request a page number as sometimes they want to start in the middle of a set of data. So I'm almost back to square one. Can you shed any light on the internal workings of SOLR that produce the slow-down in my original question? I'm hoping I can find a way to restructure my index data without having to change the higher-level API's that I support. Cheers, //Sean. On Mon, Sep 19, 2016 at 10:00 PM, Fred Dushin wrote: > All great questions, Sean. > > A few things. First off, for result sets that are that large, you are > probably going to want to use Solr cursor marks [1], which are supported in > the current version of Solr we ship. Riak allows queries using cursor > marks through the HTTP interface. At present, it does not support cursors > using the protobuf API, due to some internal limitations of the server-side > protobuf library, but we do hope to fix that in the future. > > Secondly, we have found sorting with distributed queries to be far more > performant using Solr 4.10.4. Currently released versions of Riak use Solr > 4.7, but as you can see on github [2], Solr 4.10.4 support has been merged > into the develop-2.2 branch, and is in the pipeline for release. I can't > say when the next version of Riak is that will ship with this version > because of indeterminacy around bug triage, but it should not be too long. > > I would start to look at using cursor marks and measure their relative > performance in your scenario. My guess is that you should see some > improvement there. > > -Fred > > [1] https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results > [2] https://github.com/basho/yokozuna/commit/ > f64e19cef107d982082f5b95ed598da96fb419b0 > > > > On Sep 19, 2016, at 4:48 PM, sean mcevoy wrote: > > > > Hi All, > > > > We have an index with ~548,000 entries, ~14,000 of which match one of > our queries. > > We read these in a paginated search and the first page (of 100 hits) > returns quickly in ~70ms. > > This response time seems to increase exponentially as we walk through > the pages: > > the 4th page takes ~200ms, > > the 8th page takes ~1200ms > > the 12th page takes ~2100ms > > the 16th page takes ~6100ms > > the 20th page takes ~24000ms > > > > And by the time we're searching for the 22nd page it regularly times out > at the default 60 seconds. > > > > I have a good unsderstanding of riak KV internals but absolutely nothing > of Lucene which I think is what's most relevant here. If anyone in the know > can point me towards any relevant resource or can explain what's happening > I'd be much obliged :-) > > As I would also be if anyone with experience of using Riak/Lucene can > tell me: > > - Is 500K a crazy number of entries to put into one index? > > - Is 14K a crazy number of entries to expect to be returned? > > - Are there any methods we can use to make the search time more constant > across the full search? > > I read one blog post on inlining but it was a bit old & not very obvious > how to implement using riakc_pb_socket calls. > > > > And out of curiosity, do we not traverse the full range of hits for each > page? I naively thought that because I'm sorting the returned values we'd > have to get them all first and then sort, but the response times suggests > otherwise. Does Lucene store the data sorted by each field just in case a > query asks for it? Or what other magic is going on? > > > > > > For the technical details, we use the "_yz_default" schema and all the > fields stored are strings: > > - entry_id_s: unique within the DB, the aim of the query is to gather a > list of these > > - type_s: has one of 2 values > > - sub_category_id_s: in the query described above all 14K hits will > match on this, in the DB of ~500K entries there are ~43K different values > for this field, withe each category typically having 2-6 sub categories > > - category_id_s: not matched in this query, in the DB of ~500K entries > there are ~13K different values for this field > > - status_s: has one of 2 values, in the query described baove all hits > will have the value "active&q
Doc typo
Hi Basho guys, What's your procedure on reporting documentation bugs? Just found a typo on this page: https://docs.basho.com/riak/ts/1.4.0/using/creating-activating/ The command under the heading "Creating a table with riak-admin" reads: "CREATE TABLE GeoCheckin (id, SINT64 NOT NULL, region VAR..." Which throws an error about a comma, while this version works: "CREATE TABLE GeoCheckin (id SINT64 NOT NULL, region VAR..." //Sean. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Doc typo
Cheers Luca, easy when you know how ;-) PR has been made. //Sean. On Tue, Nov 15, 2016 at 9:31 AM, Luca Favatella < luca.favate...@erlang-solutions.com> wrote: > On 15 November 2016 at 09:17, sean mcevoy wrote: > [...] > >> Hi Basho guys, >> >> What's your procedure on reporting documentation bugs? >> >> >> > Hi Sean, > > I understand the source of the docs is at https://github.com/basho/ > basho_docs and the usual pull requests workflow applies. > > Regards > Luca > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Start up problem talking to Riak
Hi David, I vaguely remember the same problem from a previous setup I did, a while ago now. IIRC, the original configured IP gets written to disk on the initial start and then the next start fails due to the mis-match. Try deleting your data directory and restarting, so this will be like the initial startup again. //Sean. On Thu, Feb 16, 2017 at 10:31 AM, AWS wrote: > OK, I took the machine down intending to write my own quick and easy key > value db (I have done this before) but I have told my uni that I am using > Riak so I thought that I ought to have another go. > > I have reinstalled Ubuntu and Riak. The computer now has an internal IP > (192.168.1.94) rather than an external fixed IP. I started up Riak and got > a pong. I then tried to connect with my software from that already works > with the Riak I have running on Amazon AWS so I know that the software > works - just a request for a list of buckets. > > I got a "Connection refused error. I checked and 8098 was closed. I edited > Riak.conf as advices from 127.0.0.0 to 0.0.0.0 but now Riak won't start > (Riak failed to start in 15 seconds). > > I really want to get this working and, given I have an assignment due in a > few days, sooner rather than later. It is working fine on AWS but it is > such a faff getting onto that using ssh as I am never sure of my key to use > so I can't check the config there. > > This can't be hard, can it? > > Please help. > > David > > > > - Original Message - > *From:* "Alex Moore" > *To:* "Alexander Sicular" > *Cc:* "AWS" , "riak-users@lists.basho.com" < > riak-users@lists.basho.com> > *Subject:* Re: Start up problem talking to Riak > *Date:* 02/13/2017 15:45:28 (Mon) > > Yeah, what Alex said. You can't see it with your application because it's > currently bound to the localhost loopback address, but > it's bad to just expose everything publicly. > > 1. Where is this cluster running? (AWS or local dev cluster?) > 2. What are you trying to connect to Riak with? Is it one of our clients > or just raw HTTP requests? > > Thanks, > Alex > > On Mon, Feb 13, 2017 at 10:33 AM, Alexander Sicular > wrote: > >> Please don't do that. Don't point the internet at your database. Have >> them communicate amongst each other on internal ips and route the public >> through a proxy / middleware. >> >> -Alexander >> >> @siculars >> http://siculars.posthaven.com >> >> Sent from my iRotaryPhone >> >> > On Feb 13, 2017, at 04:00, AWS wrote: >> > >> > I know that this isn't directly a Riak issue but I am sure that some >> of you have met this before and can maybe help me. I am used to Macs and >> Windows but have now set up an Ubuntu 14.04LTS server on my home network. I >> have 5 fixed IP addresses so the server has its own external address. I >> have opened port 8098 on my router to point at the server and checked that >> ufw isn't running. I have tested with it running ufw and with 'allow 8098' >> applied. I still cannot connect to Riak. On the same computer I get a pong >> back to a ping so Riak seems to be OK. >> > >> > I have a Riak server running on AWS and had trouble setting that up >> until I, eventually, opened all ports. >> > >> > Can anyone please suggest some steps that I might take? I need this >> running for an Open University course that I am studying. My AWS free >> server runs out before the course finishes so I have to get this up and >> running soon. >> > Thanks in advance. >> > David >> > >> > Message sent using Winmail Mail Server >> > ___ >> > riak-users mailing list >> > riak-users@lists.basho.com >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> ___ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > > -- > > Message sent using Winmail Mail Server > > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Solr search response time spikes
Hi List, We have a standard riak cluster with 5 nodes and at the minute the traffic levels are fairly low. Each of our application nodes has 25 client connections, 5 to each riak node which get selected in a round robin. Our application level requests involve multiple riak requests so our traffic tends to make requests in small bursts. Everything works fine for KV gets, puts & deletes but we're seeing timeouts & weird response time spikes on solr search operations. In the past 36 hours (the only period I have riak stats for) I see one response time of 38.8 seconds, 3 hours earlier a response time of 20.8 seconds, and the third biggest spike is an acceptable 3.5 seconds. See below all search_query stats for the minute of the 38 sec sample. In the application request we made 5 riak search requests to the same index in parallel, which happens for each request of this type and normally doesn't have an issue. But in this case all 5 timed out, and one timed out again on retry with the other 4 succeeding. Anyone ever seen anything like this before? Is there any known deadlock in solr that I might hit if I make the same request on another connection before the first has completed? This is what we do when our riak client times out after 2 seconds and immediately retries. Any advice or pointers welcomed. Thanks, //Sean. Riak node 1 search_query_throughput_one: 14 search_query_throughput_count: 259 search_query_latency_min: 2776 search_query_latency_median: 69411 search_query_latency_mean: 4900973 search_query_latency_max: 38887902 search_query_latency_999: 38887902 search_query_latency_99: 38887902 search_query_latency_95: 2046215 search_query_fail_one: 0 search_query_fail_count: 0 Riak node 2 search_query_throughput_one: 22 search_query_throughput_count: 564 search_query_latency_min: 4006 search_query_latency_median: 8800 search_query_latency_mean: 11834 search_query_latency_max: 25509 search_query_latency_999: 25509 search_query_latency_99: 25509 search_query_latency_95: 24035 search_query_fail_one: 0 search_query_fail_count: 0 Riak node 3 search_query_throughput_one: 6 search_query_throughput_count: 298 search_query_latency_min: 3200 search_query_latency_median: 15391 search_query_latency_mean: 18062 search_query_latency_max: 31759 search_query_latency_999: 31759 search_query_latency_99: 31759 search_query_latency_95: 31759 search_query_fail_one: 0 search_query_fail_count: 0 Riak node 4 search_query_throughput_one: 8 search_query_throughput_count: 334 search_query_latency_min: 2404 search_query_latency_median: 7230 search_query_latency_mean: 10211 search_query_latency_max: 22502 search_query_latency_999: 22502 search_query_latency_99: 22502 search_query_latency_95: 22502 search_query_fail_one: 0 search_query_fail_count: 0 Riak node 5 search_query_throughput_one: 0 search_query_throughput_count: 0 search_query_latency_min: 0 search_query_latency_median: 0 search_query_latency_mean: 0 search_query_latency_max: 0 search_query_latency_999: 0 search_query_latency_99: 0 search_query_latency_95: 0 search_query_fail_one: 0 search_query_fail_count: 0 ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Solr search response time spikes
Hi Fred, Thanks for taking the time! Yes, I noticed that unbalance yesterday when writing, looked into it after sending and found our config is corrupt with one node ommitted and another in there twice. But, with such low traffic levels and the spikes being on the non-favoured node I'm not currently ranking that as a likely factor. Another interesting case from last night, this sample was taken at 2017-6-23 06:04:09 Riak node 1 "search_query_throughput_one": 27 "search_query_latency_max": 10417 Riak node 2 "search_query_throughput_one": 49 "search_query_latency_max": 8952 Riak node 3 "search_query_throughput_one": 18 "search_query_throughput_count": 2507 "search_query_latency_min": 1757 "search_query_latency_median": 14775 "search_query_latency_mean": 5628361 "search_query_latency_max": 18298854 "search_query_latency_999": 18298854 "search_query_latency_99": 18298854 "search_query_latency_95": 16539782 Riak node 4 "search_query_throughput_one": 25 "search_query_latency_max": 10217 Brushing up my maths and focussing on node 3, from the 99 & 95% figures we can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in total. And from the request count for the minute & the mean we can tell that in total these 18 requests spent a total of 101,310 ms being processed. >From the median & min we know the 9 quickest took between 18 & 265 ms in total. This leaves in the region of 66 sec for the other 7 requests, enough for all 7 to have timed out. Cross referencing with our application logs I can see: On application node 1 at 2017-06-23 06:03:17 we had 3 search request timeouts to index A with 3 different filters, one field of which, lets call it field X, had the same value. We immediately retried these and at 2017-06-23 06:03:19 2 of those timed out again and were retried again. They all succeeded on this retry, so this suggests that the same requests sent to other riak nodes was fine, but to this riak node at this time was a problem. On application node 2 at: 2017-06-23 06:03:27 2017-06-23 06:03:29 2017-06-23 06:03:31 2017-06-23 06:03:33 we had 4 more timeouts on search requests to index A, these requests had 2 different filters but in both cases field X had the same value as in the previous example. So these application logs show 9 riak timeouts, that must correlate with the riak stats. I can't definitively say that no other search requests went to this riak node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it had a problem for 18 seconds, which is quiet a big window. The index that all these requests were directed at currently has 490K entries with 8 different fields defined in each. The corresponding riak bucket has allow_mult = false, if that's relevant. We see a similar pattern on our test system, I'm going to setup a test to repeatedly do searches and see if I can trigger this consistently. Will let ye know if anything interesting comes out of it. I know it's relatively new to the product, do we know is riak solr used much in production systems? I assume no one else has seen these spikes? //Sean. On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin wrote: > It's pretty strange that you are seeing no search latency measurements on > node 5. Are you sure your round robining is working? Are you favoring > node 1? > > In general, I don't think which node you hit for query should make a > difference, but I'd have to stare at the code some to be sure. In essence, > all the node that services the query does is convert the query into a > sharded Solr query based on a coverage plan, which changes every minute or > so, and then runs the sharded query on the local Solr node. The Solr node > then distributes the query to the rest of the nodes in the cluster, but > that's all Solr comms -- Riak is out of the picture, by then. > > Now, if you have a lot of sharded queries accumulating on one node, that > might make a difference to Solr. I am not a Solr expert, and I don't even > play one on TV. But maybe the fact that you are not hitting node 5 is > relevant for that reason? > > Can you do more analysis on your client, to make sure you are not favoring > node 1? > > -Fred > > > On Jun 22, 2017, at 10:20 AM, sean mcevoy wrote: > > > > Hi List, > > > > We have a standard riak cluster with 5 nodes and at the minute the > traffic levels are fairly low. Each of our application nodes has 25 client > connections, 5 to each riak node which get selected in a round robin. > > > > Our application level requests involve multiple riak requests so our > traffic tends to make requests in small bursts. Everything works fine for > KV
Re: Riak Intermittent Read Failures
Hi Mark, I've observed timeouts too but always on serach operation, you might have seen my thread "Solr search response time spikes". I'm getting stats by polling this every minute: http://docs.basho.com/riak/kv/2.2.3/developing/api/http/status/ The 99 & 100% response times are most interesting for debugging our problems. What client & timeout value are you using? I'm using the erlang client where the default timeout is 60 seconds, but I've over ridden that and am using 2 seconds. Interestingly, over the weekend I've started to see a few put & get timeouts on the application side, but the longest 100% response time is just under a second which points to a network delay. I'd start by polling these stats and then examining when you get an application side timeout. Maybe check the size stats too, if you can catch which key the operation timed out on it'd be worth checking the object size & sibling count for it. If nothing else this would eliminate the possibility that it's unique to a particular object. //Sean. On Sat, Jun 24, 2017 at 12:57 PM, markrthomas wrote: > Hello > > I'm getting intermiitent read failures in my cluster, i.e. timeout > > Sometimes an object returns immediately. > > Other times, nothing at all and I get a read-timeout. > > Any ideas on where I start debugging this issue? > > Thanks > > Mark > > > > -- > View this message in context: http://riak-users.197444.n3. > nabble.com/Riak-Intermittent-Read-Failures-tp4035229.html > Sent from the Riak Users mailing list archive at Nabble.com. > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Solr search response time spikes
Hi List, Fred, After a week of going cross-eyed looking at stats & trying to engineer a test case to make this happen in the test env I think I've made a breakthrough. We have a low but steady level of riak traffic but our application level actions that result in solr reads are actually fairly infrequent. And when one of these actions occur it results in multiple parallel reads to our solr indexes. What I've observed is that our timeouts are most easily reproduced after a period of inactivity. And once I see a timeout after 2 seconds I kick off multiple other reads to random keys and observe that some return instantly while others can take several seconds, but then all return at the same time. It's almost as if some shards in the java VM have gone to sleep due to inactivity and we see a cluster of timeouts when we try to read from it. I'm setting up a "pinger" script in our prod env to keep these awake and see if our observed timeout rate reduces. If this is actually our problem are there any JVM config options we can use to keep the index active all the time? //Sean. On Fri, Jun 23, 2017 at 1:48 PM, sean mcevoy wrote: > Hi Fred, > > Thanks for taking the time! > Yes, I noticed that unbalance yesterday when writing, looked into it after > sending and found our config is corrupt with one node ommitted and another > in there twice. > But, with such low traffic levels and the spikes being on the non-favoured > node I'm not currently ranking that as a likely factor. > > > Another interesting case from last night, this sample was taken at > 2017-6-23 06:04:09 > > Riak node 1 > "search_query_throughput_one": 27 > "search_query_latency_max": 10417 > > Riak node 2 > "search_query_throughput_one": 49 > "search_query_latency_max": 8952 > > Riak node 3 > "search_query_throughput_one": 18 > "search_query_throughput_count": 2507 > "search_query_latency_min": 1757 > "search_query_latency_median": 14775 > "search_query_latency_mean": 5628361 > "search_query_latency_max": 18298854 > "search_query_latency_999": 18298854 > "search_query_latency_99": 18298854 > "search_query_latency_95": 16539782 > > Riak node 4 > "search_query_throughput_one": 25 > "search_query_latency_max": 10217 > > > Brushing up my maths and focussing on node 3, from the 99 & 95% figures we > can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in > total. > And from the request count for the minute & the mean we can tell that in > total these 18 requests spent a total of 101,310 ms being processed. > From the median & min we know the 9 quickest took between 18 & 265 ms in > total. > This leaves in the region of 66 sec for the other 7 requests, enough for > all 7 to have timed out. > > > Cross referencing with our application logs I can see: > > On application node 1 at 2017-06-23 06:03:17 we had 3 search request > timeouts to index A with 3 different filters, one field of which, lets call > it field X, had the same value. > We immediately retried these and at 2017-06-23 06:03:19 2 of those timed > out again and were retried again. > They all succeeded on this retry, so this suggests that the same requests > sent to other riak nodes was fine, but to this riak node at this time was a > problem. > > On application node 2 at: > 2017-06-23 06:03:27 > 2017-06-23 06:03:29 > 2017-06-23 06:03:31 > 2017-06-23 06:03:33 > > we had 4 more timeouts on search requests to index A, these requests had 2 > different filters but in both cases field X had the same value as in the > previous example. > > > So these application logs show 9 riak timeouts, that must correlate with > the riak stats. > I can't definitively say that no other search requests went to this riak > node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it > had a problem for 18 seconds, which is quiet a big window. > > > The index that all these requests were directed at currently has 490K > entries with 8 different fields defined in each. The corresponding riak > bucket has allow_mult = false, if that's relevant. > > We see a similar pattern on our test system, I'm going to setup a test to > repeatedly do searches and see if I can trigger this consistently. Will let > ye know if anything interesting comes out of it. > > I know it's relatively new to the product, do we know is riak solr used > much in production systems? > I assume no one else has seen these spikes? > > //Sean. > > > On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin wrote: > >>
Node Recovery Questions
Hi All, A few questions on the procedure here to recover a failed node: http://docs.basho.com/riak/kv/2.2.3/using/repair-recovery/failed-node/ We lost a production riak server when AWS decided to delete a node and we plan on doing this procedure to replace it with a newly built node. A practice run in our QA environment has brought up some questions. - How can I tell when everything has synched up? I thought I could just monitor the handoffs but these completed within 5 minutes of comitting the cluster changes, the data directories continued to grow rapidly in size for at least an hour. I assume that this was data being synched to the new node but how can I tell when it has completed from the user level? Or is it left up to AAE to sync the data? - The size of the bitcask directory on the 4 original nodes is ~10GB, on the new node the size of this directory climbed to 1GB within an hour but hasn't moved much in the 4 days since. I know bitcask entries still exist until the periodic compaction but can it be right that its hanging on to 90% the disk space its using for dead data? - Not directly related to the recovery procedure, but while one node of a five-node cluster is down how is the extra load distributed within the cluster? It will still keep 3 copies of each entry, right? Are the copies that would have been on the missing node all stored on the next node in the ring, or distributed all around the cluster? Thanks in advance, //Sean. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Node Recovery Questions
Hi Martin, Thanks for taking the time. Yes, by "size of the bitcask directory" I mean I did a "du -h --max-depth=1 bitcask", so I think that would cover all the vnodes. We don't use any other backends. Those answers are helpful, will get back to this in a few days and see what I can determine about where our data physically lies. Might have more questions then. Cheers, //Sean. On Wed, Aug 8, 2018 at 6:05 PM, Martin Sumner wrote: > Based on a quick read of the code, compaction in bitcask is performed only > on "readable" files, and the current active file for writing is excluded > from that list. With default settings, that active file can grow to 2GB. > So it is possible that if objects had been replaced/deleted many times > within the active file, that space will not be recovered if all the > replacements amount to < 2GB per vnode. So at these small data sizes - you > may get a relatively significant discrepancy between an old and recovered > node in terms of disk space usage. > > On 8 August 2018 at 17:37, Martin Sumner > wrote: > >> Sean, >> >> Some partial answers to your questions. >> >> I don't believe force-replace itself will sync anything up - it just >> reassigns ownership (hence handoff happens very quickly). >> >> Read repair would synchronise a portion of the data. So if 10% of you >> data is read regularly, this might explain some of what you see. >> >> AAE should also repair your data. But if nothing has happened for 4 >> days, then that doesn't seem to be the case. It would be worth checking >> the aae-status page (http://docs.basho.com/riak/kv >> /2.2.3/using/admin/riak-admin/#aae-status) to confirm things are >> happening. >> >> I don't know if there are any minimum levels of data before bitcask will >> perform compaction. There's nothing obvious in the code that wouldn't be >> triggered way before 90%. I don't know if it will merge on the active file >> (the one currently being written to), but that is 2GB max size (configured >> through bitcask.max_file_size). >> >> When you say the size of the bitcask directory - is this the size shared >> across all vnodes on the node? I guess if each vnode has a single file >> <2GB, and there are multiple vnodes - something unexpected might happen >> here? If bitcask does indeed not merge the file active for writing. >> >> In terms of distribution around the cluster, if you have an n_val of 3 >> you should normally expect to see a relatively even distribution of the >> data on failure (certainly not it all going to one). Worst case scenario >> is that 3 nodes get all the load from that one failed node. >> >> When a vnode is inaccessible, 3 (assuming n=3) fallback vnodes are >> selected to handle the load for that 1 vnode (as that vnode would normally >> be in 3 preflists, and commonly a different node will be asked to start a >> vnode for each preflist). >> >> >> I will try and dig later into bitcask merge/compaction code, to see if I >> spot anything else. >> >> Martin >> >> >> > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Node Recovery Questions
Hi Martin, List, Just an update to let ye know how things went and what we learned. We did the force-replace procedure to bring the new node into the cluster in place of the old one. I attached to the riak erlang shell and with a little hacking was able to get all the bitcask handles and then do a bitcask:fold/3 to count keys. This showed that only a small percentage of all keys were present on the new node, even after the handoffs and transfers had completed. Following the instructions at the bottom of this page: https://docs.basho.com/riak/kv/2.2.0/using/repair-recovery/repairs/ I attached to the erlang shell again and ran these commands (replacing the IP with our actual IP) to force repairs on all vnodes: {ok, Ring} = riak_core_ring_manager:get_my_ring(). Partitions = [P || {P, 'dev1@127.0.0.1'} <- riak_core_ring:all_owners(Ring)]. [riak_kv_vnode:repair(P) || P <- Partitions]. The progress was most easily monitored with: riak-admin handoff summary and once complete the new node had the expected number of keys. Counting the keys is more than a bit hacky and occasionally caused a seg fault if there was background traffic, so I don't recommend it in general. But it did allow us to verify where the data was in our test env and then we could trust the procedure without counting keys in production. Monitoring the size of the bitcask directory is a lot lower resolution but it is at least safe, the results were similar in test & production so it was sufficient to verify the above procedure. So in short, when replacing a node the force-replace procedure doesn't actually cause data to be synched to the new node. The above erlang shell commands do force a sync. Thanks for the support! //Sean. On Thu, Aug 9, 2018 at 11:25 PM sean mcevoy wrote: > Hi Martin, > Thanks for taking the time. > Yes, by "size of the bitcask directory" I mean I did a "du -h > --max-depth=1 bitcask", so I think that would cover all the vnodes. We > don't use any other backends. > Those answers are helpful, will get back to this in a few days and see > what I can determine about where our data physically lies. Might have more > questions then. > Cheers, > //Sean. > > On Wed, Aug 8, 2018 at 6:05 PM, Martin Sumner > wrote: > >> Based on a quick read of the code, compaction in bitcask is performed >> only on "readable" files, and the current active file for writing is >> excluded from that list. With default settings, that active file can grow >> to 2GB. So it is possible that if objects had been replaced/deleted many >> times within the active file, that space will not be recovered if all the >> replacements amount to < 2GB per vnode. So at these small data sizes - you >> may get a relatively significant discrepancy between an old and recovered >> node in terms of disk space usage. >> >> On 8 August 2018 at 17:37, Martin Sumner >> wrote: >> >>> Sean, >>> >>> Some partial answers to your questions. >>> >>> I don't believe force-replace itself will sync anything up - it just >>> reassigns ownership (hence handoff happens very quickly). >>> >>> Read repair would synchronise a portion of the data. So if 10% of you >>> data is read regularly, this might explain some of what you see. >>> >>> AAE should also repair your data. But if nothing has happened for 4 >>> days, then that doesn't seem to be the case. It would be worth checking >>> the aae-status page ( >>> http://docs.basho.com/riak/kv/2.2.3/using/admin/riak-admin/#aae-status) >>> to confirm things are happening. >>> >>> I don't know if there are any minimum levels of data before bitcask will >>> perform compaction. There's nothing obvious in the code that wouldn't be >>> triggered way before 90%. I don't know if it will merge on the active file >>> (the one currently being written to), but that is 2GB max size (configured >>> through bitcask.max_file_size). >>> >>> When you say the size of the bitcask directory - is this the size shared >>> across all vnodes on the node? I guess if each vnode has a single file >>> <2GB, and there are multiple vnodes - something unexpected might happen >>> here? If bitcask does indeed not merge the file active for writing. >>> >>> In terms of distribution around the cluster, if you have an n_val of 3 >>> you should normally expect to see a relatively even distribution of the >>> data on failure (certainly not it all going to one). Worst case scenario >>> is that 3 nodes get all the load from that one failed node. >>> >>> When a vnode