Thank you for the detailed answer. I'll look into the FieldCacheRangeFilter. Is there another way of getting a page of results starting with some term that does not have similar issue? Basically I want to implement pagination of docs sorted by title, next page starting from last doc of the previous, and while, typically, getting the first page is fastest, in my case it ends up being slowest. Is it possible to disable counting hits and just return top 50 results that qualify the query?
Also, if the hit count is a direct influence on the query speed, how come some other queries that span a larger or similar set are faster? Say, a TermQuery for a term that happens to be the same for majority of docs and spans more docs than the above termrange. Aleksey On Tue, May 7, 2013 at 12:13 AM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi, > > The problem is by design: Lucene is an inverted index, so lookups can only be > done by single terms and find the documents related to every single term. To > execute a range, the query first have to position the terms enum on the first > term and then iterate over all *terms* in the index (not documents) until the > last term is reached. If the number of terms in the field is large (because > you have many distinct values), this takes some time. For every term in the > enumeration that matches the range, Lucene has to look up all matching > documents in the posting list and report them as hits (using a bitset). The > latter (looking up the posting lists involves lots of work), so ranges with > thousands of terms will get slow. > > So the time depends: How many terms are in your term dictionary between the > lower bound and the higher bound of your range, not really the size of the > index (although this is quite often directly related). > > If you want faster range queries, use maybe NumericRangeQuery, because this > has some optimizations on the cost of a large index size. But if you are > stuck with text, you may also review FieldCacheRangeFilter (which only works > for untokenized fields, but I assume from your example "title" is not > tokenized). > > The order of results of a range query is in "index order", because there is > no TF-IDF ranking involved (all hits have the same score of 1). Index order > means the order in which they were indexed. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Aleksey [mailto:bitterc...@gmail.com] >> Sent: Tuesday, May 07, 2013 4:15 AM >> To: java-user@lucene.apache.org >> Subject: TermRangeQuery performance oddness >> >> Hi guys, >> >> If I run 2 term range queries: >> >> new TermRangeQuery("title", new BytesRef("A"), null, true, true); and new >> TermRangeQuery("title", new BytesRef("Z"), null, true, true); >> >> The one that starts with "Z" is several times faster (I make 1000 queries in >> a >> loop to measure). I understand that the first one has much larger hit number, >> but if the query is bounded to 50 results, why does that matter? >> At first I thought that it grabs all hits and sorts them, but then it >> doesn't seem >> to make any difference whether or not I pass sort by "title" parameter to the >> searcher. Results are either sorted or kind of random, but speed is the same. >> Why is that? >> >> Thank you in advance, >> >> Aleksey >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org