Hi Robert - yeah this is a complex subject. I think there's room for some
exciting improvement though. There is some discussion in LUCENE-8675 that
is pointing out some potential API-level problems we may have to address in
order to make the most efficient use of the segment structure in a sorted
index. I think generally speaking we're trying to think of ways to search
partial segments. I don't have concrete proposals for API changes at this
point, but it's clear there are some hurdles to be grappled with. For
example, Adrien's point about BKD trees having a high up-front cost points
out one difficulty. If we want to search a single segment in multiple "work
units" (whether threaded or not), we want a way to share that up-front cost
without needing to pay it over again for each work unit. I think a similar
problem also occurs with some other query types (MultiTerm can produce a
bitset I believe?).

As far as the specific (prorated early termination) proposal here .. this
is something very specific and localized within TopFieldCollector that
doesn't require any public-facing API change or refactoring at all. It just
terminates a little earlier based on the segment distribution. Here's a PR
so you can see what this is: https://github.com/apache/lucene-solr/pull/564


On Mon, Feb 4, 2019 at 8:44 AM Robert Muir <rcm...@gmail.com> wrote:

> Regarding adding a threshold to TopFieldCollector, do you have ideas
> on what it would take to fix the relevant collector/indexsearcher APIs
> to make this kind of thing easier? (i know this is a doozie, but we
> should at least try to think about it, maybe make some progress)
>
> I can see where things become less efficient in this parallel+sorted
> case with large top N, but there are also many other "top k
> algorithms" that could be better for different use cases. in your
> case, if you throw out the parallel and just think about doing your
> sorted case segment-by-segment, the current code there may be
> inefficient too (not as bad, but still doesn't really take total
> advantage of sortedness). Maybe we improve that case by scoring some
> initial "range" of docs for each/some segments first, and then handle
> any "tail". With a simple google search I easily find many ideas for
> how this logic could work: exact and inexact, sorted and unsorted,
> distributed (parallel) and sequential.  So I think there are probably
> other improvements that could be done here, but worry about what the
> code might look like if we don't refactor it.
>
> On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
> <luc...@mikemccandless.com> wrote:
> >
> > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >
> >  > > In single-threaded mode we can check against minCompetitiveScore and
> > > terminate collection for each segment appropriately,
> > >
> > > > Does Lucene do this today by default?  That should be a nice
> > > optimization,
> > > and it'd be safe/correct.
> > >
> > > Yes, it does that today (in TopFieldCollector -- see
> > >
> > >
> https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > > )
> > >
> >
> > Ahh -- great, thanks for finding that.
> >
> >
> > > Re: our high cost of collection in static ranking phase -- that is
> true,
> > > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > > (modified to have a sorted index and collect concurrently) using just a
> > > vanilla TopFieldCollector. I looked at some profiler output, and it
> just
> > > seems to be showing more time spent walking postings.
> > >
> >
> > Yeah, understood -- I think pro-rating the N collected per segment makes
> a
> > lot of sense.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to