cpoerschke commented on a change in pull request #151: URL: https://github.com/apache/solr/pull/151#discussion_r641971468
########## File path: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java ########## @@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest sreq) { shardDoc.sortFieldValues = unmarshalledSortFieldValues; - queue.insertWithOverflow(shardDoc); + if(reRankQueue != null && docCounter++ <= reRankDocsSize) { + ShardDoc droppedShardDoc = reRankQueue.insertWithOverflow(shardDoc); + // FIXME: Only works if the original request does not sort by score Review comment: > The current solution only works if the original sort did not sort by score. ... > ... We have no access to the score before reRanking at this point and I currently see no possibility to retrieve it again. ... I had some more ideas around this, it's a semi-structured trail of thoughts, but sharing in that form in the hope that it might be useful and/or trigger further ideas and thoughts. ---- So as indicated by the `// FIXME: Only works if the original request does not sort by score` comment location, the problem happens when a document drops out of `reRankQueue` (where it's compared against other reranked documents via the reranked score) and joins `queue` (where it needs to be compared against other (reranked or unreranked) documents via the sort clause). One way to avoid this is to ensure dropping out never happens but that is a change to the overall reranking logic and so for now here let's assume that is not the logic we want ... OriginalScoreFeature via fl=[fv] providing an original score is too late timing wise since it happens during "get fields" after the "merge ids". ShardParams.DISTRIB_SINGLE_PASS "distrib.singlePass" could be used to bring timing forward but that would have other performance etc. implications. Implying "distrib.singlePass=true" and implying fl=[fv] and an OriginalScoreFeature that might not have been there in the model or feature store would also be messy both on shard and federator code level. All that aside, this a reranking problem generally i.e. not an LTR reranking problem specifically i.e. OriginalScoreFeature and fl=[fv] is too specific a solution. So let's leave the merging logic here alone for a moment and go to the shard code and determine where exactly the "original_score" information is lost. If we find where it's lost then we can preserve it there and pass it along via the code path that the "score" information takes and eventually it would reach the merging logic here, at least that's the theory. ---- The score changes during rescoring. In the case of LTR rescoring https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRRescorer.java#L121-L131 allocates `reranked` which is then filled from `firstPassResults` i.e. structurally this is where the information loss happens. For non-LTR rescoring something equivalent presumably happens elsewhere. For thinking purposes let's assume that we could add an `original_score` field to `ScoreDoc` itself i.e. at line 210 we could preserve the score: ``` scorer.getDocInfo().setOriginalDocScore(hit.score); + hit.original_score = hit.score; hit.score = scorer.score(); ``` ---- Next let's explore how a distributed query's scores get from the shard to the federator. (Shoutout to Hoss' "Lifecycle of a Solr Search Request" talk at this point e.g. http://people.apache.org/~hossman/rev2017/ slides.) Somewhat jumping into the middle of the code QueryComponent [L368](https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L368) and [L1512](https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L1512) ``` QueryResult result = new QueryResult(); ... rb.setResult(result); ``` `QueryResult` contains `DocListAndSet` which contains `DocList` and `DocSet`. `DocSlice` implements `DocList` and one of the places (there might be others) where `ScoreDoc` turns into `DocSlice` is at https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1623 ---- So for thinking purposes we added `original_score` field to `ScoreDoc` already and `DocSlice` contains an optional `scores` list and so let's imagine adding an optional `original_scores` list alongside it: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/DocSlice.java#L39 ``` public class DocSlice extends DocSetBase implements DocList { ... final int[] docs; // a slice of documents (docs 0-100 of the query) final float[] scores; // optional score list + final float[] original_scores; // optional original_score list final long matches; final TotalHits.Relation matchesRelation; ... ``` ---- Now then though, how will that new original score list get "transported" from shard to federator? Using an unusual(?) trick here: instead of navigating the code via DocSlice.scores we could navigate via DocSlice.matchesRelation on the assumption that it leads to the same area but more easily. One of the areas is https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/response/TextResponseWriter.java#L189-L197 and there the `res.getReturnFields()` leads to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.8.2/solr/core/src/java/org/apache/solr/search/SolrReturnFields.java#L51-L53 ``` public class SolrReturnFields extends ReturnFields { // Special Field Keys public static final String SCORE = "score"; ``` and the speculation that perhaps a SolrReturnFields.ORIGINAL_SCORE might be needed. ---- That's where the trail of thoughts led me. Of course, it is too early to reach any conclusions here but going intuitively and based on the code trails above it seems to me that getting access to the original_score is feasible but would be quite an undertaking and so before investing effort in that direction it could be helpful to first further consider and firmly rule out any overall reranking logic that requires only the already available information e.g. for shards to use `reRankDocs` and for the federator whilst combining results to _not_ use `reRankDocs` but to combine all shards' reranked documents separately in a `reRankQueue` that has the same sizing as `queue` (and then documents dropping out of `reRankQueue` need not be added to `queue` since them dropping out means there's already sufficient documents in `reRankQueue` to serve the request). I hope that is not too opinionated a way of sharing one perspective and that some bits of it are useful? This distributed reranking stuff is an intriguing problem! :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org