RE: Clarification regarding BlockTree implementation of IntersectTermsEnum

Uwe Schindler Mon, 01 Apr 2019 09:53:54 -0700

Hi,

in fact I was also wondering why the TermRangeQuery is now a subclass of 
AutomatonQuery, but this was changed for the reasons that Robert mentioned in 
his e-mail (https://issues.apache.org/jira/browse/LUCENE-5879). For sure, the 
easiest is to just start and seek to the first term in the enum, then iterate 
over all terms and stop when you reach the last term.


The problem with TermRangeQueries is actually not the iteration over the term 
index. The slowness comes from the fact that all terms between start and end 
have to be iterated and their postings be fetched and those postings be merged 
together. If the "source of terms" for doing this is just a simple linear 
iteration of all terms from/to or the automaton intersection does not really 
matter for the query execution. The change to prefer the automaton instead of a 
simple term iteration is just to allow further optimizations, for more info see 
https://issues.apache.org/jira/browse/LUCENE-5879

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Robert Muir <rcm...@gmail.com>
> Sent: Monday, April 1, 2019 6:30 PM
> To: java-user <java-user@lucene.apache.org>
> Subject: Re: Clarification regarding BlockTree implementation of
> IntersectTermsEnum
> 
> The regular TermsEnum is really designed for walking terms in linear order.
> it does have some ability to seek/leapfrog. But this means paths in a query
> automaton that match no terms result in a wasted seek and cpu, because the
> api is designed to return the next term after regardless.
> 
> On the other hand the intersect() is for intersecting two automata: query
> and index. Presumably it can also remove more inefficiencies than just the
> wasted seeks for complex wildcards and fuzzies and stuff, since it can
> "see" the whole input as an automaton. so for example it might be able to
> work on blocks of terms at a time and so on.
> 
> On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis <zabe...@gmail.com>
> wrote:
> 
> > Yes it is used.
> >
> > I think there are simpler and possibly more efficient ways to implement a
> > TermRangeQuery and that is why I am looking into this.
> > But I am also curious to understand what IntersectTermsEnum is supposed
> to
> > do.
> >
> > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rcm...@gmail.com>
> > έγραψε:
> >
> > > Is this IntersectTermsEnum really being used for term range query?
> Seems
> > > like using a standard TermsEnum, seeking to the start of the range, then
> > > calling next until the end would be easier.
> > >
> > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> <zabe...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am currently working on improving the performance of range queries
> on
> > > > strings. I've noticed that using TermRangeQuery with low-selective
> > > queries
> > > > is a very bad idea in terms of performance but I cannot clearly explain
> > > why
> > > > since it seems related with how the IntersectTermsEnum#next method
> is
> > > > implemented.
> > > >
> > > > The Javadoc of the class says that the terms index (the burst-trie
> > > > datastructure) is not used by this implementation of TermsEnum.
> > However,
> > > > when I see the implementation of the next method I get the impression
> > > that
> > > > this is not accurate. Aren't we using the trie structure to skip parts
> > of
> > > > the data when  the automaton states do not match?
> > > >
> > > > Can somebody provide a high-level intutition of what
> > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > traversing
> > > > the whole trie structure (skipping some branches when necessary) but I
> > > may
> > > > be wrong.
> > > >
> > > > Thanks in advance,
> > > > Stamatis
> > > >
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Clarification regarding BlockTree implementation of IntersectTermsEnum

Reply via email to