[ 
https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863075#comment-16863075
 ] 

Atri Sharma commented on LUCENE-8829:
-------------------------------------

bq. I am not sure we have to. Can't a user initialize it ahead of time if 
necessary. I think if it's necessary to have this we can just iterate over it 
and set it from the outside?

The merge API today allows users to indicate if they want the shardIndex in 
hits to be recorded during the merging process or not. In the latter case, the 
invariant is that the shard indices should already be set externally by the 
user before calling merge. If the user does not indicate that they want shard 
indices to be set, nor does the user set them externally, then we throw an 
error because shard indices are needed for tie breaking.

Post the introduction of the functional interface, the latter invariant becomes 
invalid i.e. we no longer need shard indices to be always present (in the case 
of tie breaking by docIDs). However, the user might still want us to record the 
shard indices during merge. So if we remove setShardIndex completely, then we 
remove this functionality. I am not saying that we should not do it, but just 
wanted to ensure that call out all the tradeoffs. WDYT?

P.S.: I opened LUCENE-8857 to build upon your proposed approach and extend it 
to have user specified custom tie breakers. Please share your thoughts.

> TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8829
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8829
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Atri Sharma
>            Priority: Major
>         Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, 
> LUCENE-8829.patch
>
>
> While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
> results are indirectly dependent on the number of collectors involved in the 
> merge. This is troubling because 1) The number of collectors involved in a 
> merge are cost based and directly dependent on the number of slices created 
> for the parallel searcher case. 2) TopN hits code path will invoke merge with 
> a single Collector, so essentially, doing the same TopN query with single 
> threaded and parallel threaded searcher will invoke different order of 
> results, which is a bad invariant that breaks.
>  
> The reason why this happens is because of the subtle way TopDocs#merge sets 
> shardIndex in the ScoreDoc population during populating the priority queue 
> used for merging. ShardIndex is essentially set to the ordinal of the 
> collector which generates the hit. This means that the shardIndex is 
> dependent on the number of collectors, even for the same set of hits.
>  
> In case of no sort order specified, shardIndex is used for tie breaking when 
> scores are equal. This translates to different orders for same hits with 
> different shardIndices.
>  
> I propose that we remove shardIndex from the default tie breaking mechanism 
> and replace it with docID. DocID order is the de facto that is expected 
> during collection, so it might make sense to use the same factor during tie 
> breaking when scores are the same.
>  
> CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to