Hello Mike,

thank you for the explanation!
I created a jira issue: LUCENE-7861

Best regards,
Christoph

Am 25.05.2017 um 16:11 schrieb Michael McCandless:
Yes, there is a (hidden) assumption in TopDocs.merge that the hits it's
merging are logically non-overlapping, sequential slices of the index, but
in your case they are "interleaved".

TopDocs.merge doesn't otherwise trust the incoming docID to be from the
same docID space, and in your case it is.

Maybe we could improve TopDocs.merge to optionally use the already global
docID for tie breaking?

Yes, please open an issue.  Maybe we just improve the javadocs as you
suggested, but the situation sure is trappy today.

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 24, 2017 at 10:06 AM, Christoph Kaser <lucene_l...@iconparc.de>
wrote:

Hello everybody,

I have observed an unexpected behavior in Lucene, and I am unsure whether
this is a bug, or a missing warning in the documentation:

I am using the IndexSearcher with an ExecutorService in order to take
advantage of multiple CPU cores during the searches. I want to limit the
number of cores a single search can occupy, so I have overwritten the
IndexSearcher method
     protected LeafSlice[] slices(List<LeafReaderContext> leaves)
to return a fixed number of Slices. (e.g. 4).

I tried to create slices that are about the same size by looping over the
leaves (ordered by size descending) and adding the current leaf to the
slice with the smallest number of documents.

This worked well, until I stumbled upon a query for which searchAfter
seemed to skip hits, so that the total number of hits obtained by multiple
calls to searchAfter was lower than TopDocs.totalHits.

The issue seems to be how searchAfter works vs how TopDocs.merge works:

searchAfter skips every document with a higher score than the "after"
document. In case of equal scores, it uses the document id and skips every
document with a <= document id (see PagingFieldCollector).

TopDocs.merge uses the score to determine which hits should be part of the
merged TopDocs. In case of equal scores, it uses the shard index (this
corresponds to the slices the IndexSearcher uses) to break ties (see
ScoreMergeSortQueue.lessThan)

So if the shards are noncontinuous (as they are in my case), searchAfter
uses a different way of sorting the documents than TopDocs.merge, and
therefore hits are skipped.

Here are my questions:

* Are slices meant to be continuous "sublists" of the passed leaves-list?
Or is my way of slicing meant to be supported?
* If my way of slicing is not supported, could you either add a warning to
the javadocs of the slices method or maybe even add  a check for a legal
return value of slices()?
* Should I create a jira issue for this?

Sorry for the wall of text, I hope I explained the problem in an
understandable way!

Thank you and best regards
Christoph




Reply via email to