cpoerschke commented on a change in pull request #151: URL: https://github.com/apache/solr/pull/151#discussion_r642675160
########## File path: solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java ########## @@ -999,20 +1013,34 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest sreq) { shardDoc.sortFieldValues = unmarshalledSortFieldValues; - queue.insertWithOverflow(shardDoc); + if(reRankQueue != null && docCounter++ <= reRankDocsSize) { + ShardDoc droppedShardDoc = reRankQueue.insertWithOverflow(shardDoc); + // FIXME: Only works if the original request does not sort by score Review comment: > ... but that would entail a PR to Lucene and we would have to wait for a newer version ... Maybe, maybe not. In terms of thinking about the code it is perhaps easiest to imagine `ScoreDoc` having an additional `original_score` element but in terms of writing the code it might not necessarily work out that way e.g. instead imagine within Solr ``` public class ScoreDocWithOriginalScore extends org.apache.lucene.search.ScoreDoc { public float original_score; ... } ``` or something similar and for that class to be used if and only if needed thus reducing impact of the changes on code paths that don't need the change. But still even then there'd be quite a bit of Solr code to cover to build confidence that the original score makes it to all the right places and isn't somehow lost somewhere. > ... to collect the returned documents before reranking in one queue (queue A) and the reranked documents in another queue (queue B). ... > ... But the reranking happens on the shards and we are on the coordinator (in mergeIds), so we only have the already "mixed" (reranked and nor reranked) results from the shards. ... Yes, the idea was to collect reranked and not-reranked documents in separate non-overlapping queues and it _assumed_ that within our "mixed" results we _can already tell apart_ the reranked and not-reranked results i.e. within each shard's response the first `reRankDocs` documents will be reranked and documents (if any) after that are not-reranked i.e. the `i < reRankDocsSize` condition. I have not verified if that assumption is accurate but here's a short example to explore the idea, noting that in the dual sharded collection the client application chooses a halfed reRankDocs on the understanding that the collection has two shards. ``` single sharded collection shard1: a b c d e f g h i j x y z # without reranking shard1: J I H G F E D C B A x y z # with reRankDocs=10 (assuming reranking reverses order of reranked docs and uppercases them) overall results: J I H G F E D C B A x y z ``` ``` dual sharded collection shard1: a c e g i y # without reranking shard2: b d f h j x z # without reranking shard1: I G E C A y # with reRankDocs=5 (assuming reranking reverses order of reranked docs and uppercases them) shard2: J H F D B x z # with reRankDocs=5 (assuming reranking reverses order of reranked docs and uppercases them) reRankedQueue: J I H G F E D C B A queue: x y z overall results: J I H G F E D C B A x y z ``` > ... What do you think about excluding ... to another PR? ... people with custom sorting that could already benefit from the first part ... Yes, that possibility had occurred to me too :-) In principle I think a fix for half of the issue could be better than no fix: * if it can be clearly documented/communicated what is fixed and what is not fixed * if existing behaviour is maintained for the 'not (yet) fixed' scenario * if we anticipate that the fix for the second half is not going to break/change things for anyone who benefitted from the first fix In terms of the specifics here, so the "fixed" condition would be that "sort" did not use "score" in any way i.e. `sort=popularity desc` is "fixed" but both `sort=score desc` and `sort=popularity desc, score desc` is "not fixed" as yet. I wonder if `QueryComponent` and `SortedHitQueueManager` could accommodate that logic so that we would maintain existing behaviour for the not-yet-fixed use cases? > ... If I did not understand your last idea correctly, please let me know. ... Thanks for identifying which bits were and weren't clear! I've tried to clarify with the example and making the key _assumption_ explicit. Please let me know if that was useful or not. I think that approach would solve both the without-score and the with-score scenario. > ... And please be aware that I do not want to create pressure by suggesting the exclusion of the fix from this PR. :) I did not perceive any pressure here at all but appreciate your openness and clarity w.r.t. no pressure being intended, thank you. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org