RE: TermRangeQuery performance oddness

Uwe Schindler Tue, 07 May 2013 00:14:28 -0700

Hi,

The problem is by design: Lucene is an inverted index, so lookups can only be 
done by single terms and find the documents related to every single term. To 
execute a range, the query first have to position the terms enum on the first 
term and then iterate over all *terms* in the index (not documents) until the 
last term is reached. If the number of terms in the field is large (because you 
have many distinct values), this takes some time. For every term in the 
enumeration that matches the range, Lucene has to look up all matching 
documents in the posting list and report them as hits (using a bitset). The 
latter (looking up the posting lists involves lots of work), so ranges with 
thousands of terms will get slow.


So the time depends: How many terms are in your term dictionary between the 
lower bound and the higher bound of your range, not really the size of the 
index (although this is quite often directly related).

If you want faster range queries, use maybe NumericRangeQuery, because this has 
some optimizations on the cost of a large index size. But if you are stuck with 
text, you may also review FieldCacheRangeFilter (which only works for 
untokenized fields, but I assume from your example "title" is not tokenized).

The order of results of a range query is in "index order", because there is no 
TF-IDF ranking involved (all hits have the same score of 1). Index order means 
the order in which they were indexed.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Aleksey [mailto:bitterc...@gmail.com]
> Sent: Tuesday, May 07, 2013 4:15 AM
> To: java-user@lucene.apache.org
> Subject: TermRangeQuery performance oddness
> 
> Hi guys,
> 
> If I run 2 term range queries:
> 
> new TermRangeQuery("title", new BytesRef("A"), null, true, true); and new
> TermRangeQuery("title", new BytesRef("Z"), null, true, true);
> 
> The one that starts with "Z" is several times faster (I make 1000 queries in a
> loop to measure). I understand that the first one has much larger hit number,
> but if the query is bounded to 50 results, why does that matter?
> At first I thought that it grabs all hits and sorts them, but then it doesn't 
> seem
> to make any difference whether or not I pass sort by "title" parameter to the
> searcher. Results are either sorted or kind of random, but speed is the same.
> Why is that?
> 
> Thank you in advance,
> 
> Aleksey
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: TermRangeQuery performance oddness

Reply via email to