Hi Robert - yeah this is a complex subject. I think there's room for some exciting improvement though. There is some discussion in LUCENE-8675 that is pointing out some potential API-level problems we may have to address in order to make the most efficient use of the segment structure in a sorted index. I think generally speaking we're trying to think of ways to search partial segments. I don't have concrete proposals for API changes at this point, but it's clear there are some hurdles to be grappled with. For example, Adrien's point about BKD trees having a high up-front cost points out one difficulty. If we want to search a single segment in multiple "work units" (whether threaded or not), we want a way to share that up-front cost without needing to pay it over again for each work unit. I think a similar problem also occurs with some other query types (MultiTerm can produce a bitset I believe?).
As far as the specific (prorated early termination) proposal here .. this is something very specific and localized within TopFieldCollector that doesn't require any public-facing API change or refactoring at all. It just terminates a little earlier based on the segment distribution. Here's a PR so you can see what this is: https://github.com/apache/lucene-solr/pull/564 On Mon, Feb 4, 2019 at 8:44 AM Robert Muir <rcm...@gmail.com> wrote: > Regarding adding a threshold to TopFieldCollector, do you have ideas > on what it would take to fix the relevant collector/indexsearcher APIs > to make this kind of thing easier? (i know this is a doozie, but we > should at least try to think about it, maybe make some progress) > > I can see where things become less efficient in this parallel+sorted > case with large top N, but there are also many other "top k > algorithms" that could be better for different use cases. in your > case, if you throw out the parallel and just think about doing your > sorted case segment-by-segment, the current code there may be > inefficient too (not as bad, but still doesn't really take total > advantage of sortedness). Maybe we improve that case by scoring some > initial "range" of docs for each/some segments first, and then handle > any "tail". With a simple google search I easily find many ideas for > how this logic could work: exact and inexact, sorted and unsorted, > distributed (parallel) and sequential. So I think there are probably > other improvements that could be done here, but worry about what the > code might look like if we don't refactor it. > > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless > <luc...@mikemccandless.com> wrote: > > > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov <msoko...@gmail.com> > wrote: > > > > > > In single-threaded mode we can check against minCompetitiveScore and > > > terminate collection for each segment appropriately, > > > > > > > Does Lucene do this today by default? That should be a nice > > > optimization, > > > and it'd be safe/correct. > > > > > > Yes, it does that today (in TopFieldCollector -- see > > > > > > > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225 > > > ) > > > > > > > Ahh -- great, thanks for finding that. > > > > > > > Re: our high cost of collection in static ranking phase -- that is > true, > > > Mike, but I do also see a nice improvement on the luceneutil benchmark > > > (modified to have a sorted index and collect concurrently) using just a > > > vanilla TopFieldCollector. I looked at some profiler output, and it > just > > > seems to be showing more time spent walking postings. > > > > > > > Yeah, understood -- I think pro-rating the N collected per segment makes > a > > lot of sense. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >