Hi All, We have an index with ~548,000 entries, ~14,000 of which match one of our queries. We read these in a paginated search and the first page (of 100 hits) returns quickly in ~70ms. This response time seems to increase exponentially as we walk through the pages: the 4th page takes ~200ms, the 8th page takes ~1200ms the 12th page takes ~2100ms the 16th page takes ~6100ms the 20th page takes ~24000ms
And by the time we're searching for the 22nd page it regularly times out at the default 60 seconds. I have a good unsderstanding of riak KV internals but absolutely nothing of Lucene which I think is what's most relevant here. If anyone in the know can point me towards any relevant resource or can explain what's happening I'd be much obliged :-) As I would also be if anyone with experience of using Riak/Lucene can tell me: - Is 500K a crazy number of entries to put into one index? - Is 14K a crazy number of entries to expect to be returned? - Are there any methods we can use to make the search time more constant across the full search? I read one blog post on inlining but it was a bit old & not very obvious how to implement using riakc_pb_socket calls. And out of curiosity, do we not traverse the full range of hits for each page? I naively thought that because I'm sorting the returned values we'd have to get them all first and then sort, but the response times suggests otherwise. Does Lucene store the data sorted by each field just in case a query asks for it? Or what other magic is going on? For the technical details, we use the "_yz_default" schema and all the fields stored are strings: - entry_id_s: unique within the DB, the aim of the query is to gather a list of these - type_s: has one of 2 values - sub_category_id_s: in the query described above all 14K hits will match on this, in the DB of ~500K entries there are ~43K different values for this field, withe each category typically having 2-6 sub categories - category_id_s: not matched in this query, in the DB of ~500K entries there are ~13K different values for this field - status_s: has one of 2 values, in the query described baove all hits will have the value "active" - user_id_s: unique within the DB but not matched in this query - first_name_s: almost unique within the DB, this query will sort by this field - last_name_s: almost unique within the DB, this query will sort by this field This search query looks like: <<"sub_category_id_s:test_1 AND status_s:active AND type_s:sub_category">> Our options parameter has the sort directive: {sort, <<"first_name_s asc, last_name_s asc">>} The query was run on a 5-node cluster with n_val of 3. Thanks in advance fo rany pointers! //Sean.
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com