[jira] [Updated] (SOLR-17976) Solr 9.5 distributed search tie breaking logic is non-deterministic

Yue Yu (Jira) Wed, 22 Oct 2025 11:08:53 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yue Yu updated SOLR-17976:
--------------------------
    Description: 
In the mergeIds function of QueryComponent, this heap ShardFieldSortedHitQueue 
is used to order the ShardDoc. However, in the *lessThan* function:

 
{code:java}
protected boolean lessThan(ShardDoc docA, ShardDoc docB) {
// If these docs are from the same shard, then the relative order // is how 
they appeared in the response from that shard. if (Objects.equals(docA.shard, 
docB.shard)) {
// if docA has a smaller position, it should be "larger" so it // comes before 
docB. // This will handle sorting by docid within the same shard // comment 
this out to test comparators. return !(docA.orderInShard < docB.orderInShard);
}
// run comparators final int n = comparators.length;
int c = 0;
for (int i = 0; i < n && c == 0; i++) {
c =
(fields[i].getReverse())
? comparators[i].compare(docB, docA)
: comparators[i].compare(docA, docB);
}
// solve tiebreaks by comparing shards (similar to using docid) // smaller 
docid's beat larger ids, so reverse the natural ordering if (c == 0) {
c = -docA.shard.compareTo(docB.shard);
}
return c < 0;
}
{code}
The last tie-breaking logic is comparing ShardDoc.shard:
{code:java}
// solve tiebreaks by comparing shards (similar to using docid)// smaller 
docid's beat larger ids, so reverse the natural orderingif (c == 0) {
c = -docA.shard.compareTo(docB.shard);
}{code}

 Here ShardDoc.shard contains node ip as well as shard name, for example: 
[http://127.0.0.1:8983/solr/my_collection_shard1_replica_n1]
Consider this setup: 1 collection with 2 shard 2 replica running on a 2 nodes 
cluster. For the same query, we may have documents coming from the following 
core combinations:
 # [http://node1_ip:8983/solr/my_collection_shard1_replica_n1] + 
[http://node2_ip:8983/solr/my_collection_shard2_replica_n2]
 # [http://node2_ip:8983/solr/my_collection_shard1_replica_n2] + 
[http://node1_ip:8983/solr/my_collection_shard2_replica_n1]

Hence the same request may have different document rankings when there are 
documents from both shards with the same scores. This can get worse with more 
nodes/shards/replicas. 
I'm wondering if we should just use the shard name for tie breaking instead (no 
node ip), if that's possible

  was:
In the mergeIds function of QueryComponent, this heap ShardFieldSortedHitQueue 
is used to order the ShardDoc. However, in the *lessThan* function:

{color:#cf8e6d}protected boolean {color}{color:#56a8f5}lessThan{color}(ShardDoc 
docA, ShardDoc docB) {
{color:#7a7e85}// If these docs are from the same shard, then the relative order
{color}{color:#7a7e85} // is how they appeared in the response from that shard.
{color}{color:#7a7e85} {color}{color:#cf8e6d}if 
{color}(Objects.equals(docA.{color:#c77dbb}shard{color}, 
docB.{color:#c77dbb}shard{color})) {
{color:#7a7e85}// if docA has a smaller position, it should be "larger" so it
{color}{color:#7a7e85} // comes before docB.
{color}{color:#7a7e85} // This will handle sorting by docid within the same 
shard
{color}{color:#7a7e85}
{color}{color:#7a7e85} // comment this out to test comparators.
{color}{color:#7a7e85} {color}{color:#cf8e6d}return 
{color}!(docA.{color:#c77dbb}orderInShard {color}< 
docB.{color:#c77dbb}orderInShard{color});
}

{color:#7a7e85}// run comparators
{color}{color:#7a7e85} {color}{color:#cf8e6d}final int {color}n = 
{color:#c77dbb}comparators{color}.{color:#c77dbb}length{color};
{color:#cf8e6d}int {color}c = {color:#2aacb8}0{color};
{color:#cf8e6d}for {color}({color:#cf8e6d}int {color}i = 
{color:#2aacb8}0{color}; i < n && c == {color:#2aacb8}0{color}; i++) {
c =
({color:#c77dbb}fields{color}[i].getReverse())
? {color:#c77dbb}comparators{color}[i].compare(docB, docA)
: {color:#c77dbb}comparators{color}[i].compare(docA, docB);
}

{color:#7a7e85}// solve tiebreaks by comparing shards (similar to using docid)
{color}{color:#7a7e85} // smaller docid's beat larger ids, so reverse the 
natural ordering
{color}{color:#7a7e85} {color}{color:#cf8e6d}if {color}(c == 
{color:#2aacb8}0{color}) {
c = 
-docA.{color:#c77dbb}shard{color}.compareTo(docB.{color:#c77dbb}shard{color});
}

{color:#cf8e6d}return {color}c < {color:#2aacb8}0{color};
}The last tie-breaking logic is comparing ShardDoc.shard:
{color:#7a7e85}// solve tiebreaks by comparing shards (similar to using docid)
{color}{color:#7a7e85}// smaller docid's beat larger ids, so reverse the 
natural ordering
{color}{color:#cf8e6d}if {color}(c == {color:#2aacb8}0{color}) {
c = 
-docA.{color:#c77dbb}shard{color}.compareTo(docB.{color:#c77dbb}shard{color});
}
 
Here ShardDoc.shard contains node ip as well as shard name, for example: 
[http://127.0.0.1:8983/solr/my_collection_shard1_replica_n1]
Consider this setup: 1 collection with 2 shard 2 replica running on a 2 nodes 
cluster. For the same query, we may have documents coming from the following 
core combinations:
 # [http://node1_ip:8983/solr/my_collection_shard1_replica_n1] + 
[http://node2_ip:8983/solr/my_collection_shard2_replica_n2]
 # [http://node2_ip:8983/solr/my_collection_shard1_replica_n2] + 
[http://node1_ip:8983/solr/my_collection_shard2_replica_n1]

Hence the same request may have different document rankings when there are 
documents from both shards with the same scores. This can get worse with more 
nodes/shards/replicas. 
I'm wondering if we should just use the shard name for tie breaking instead (no 
node ip), if that's possible


> Solr 9.5 distributed search tie breaking logic is non-deterministic
> -------------------------------------------------------------------
>
>                 Key: SOLR-17976
>                 URL: https://issues.apache.org/jira/browse/SOLR-17976
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yue Yu
>            Priority: Major
>
> In the mergeIds function of QueryComponent, this heap 
> ShardFieldSortedHitQueue is used to order the ShardDoc. However, in the 
> *lessThan* function:
>  
> {code:java}
> protected boolean lessThan(ShardDoc docA, ShardDoc docB) {
> // If these docs are from the same shard, then the relative order // is how 
> they appeared in the response from that shard. if (Objects.equals(docA.shard, 
> docB.shard)) {
> // if docA has a smaller position, it should be "larger" so it // comes 
> before docB. // This will handle sorting by docid within the same shard // 
> comment this out to test comparators. return !(docA.orderInShard < 
> docB.orderInShard);
> }
> // run comparators final int n = comparators.length;
> int c = 0;
> for (int i = 0; i < n && c == 0; i++) {
> c =
> (fields[i].getReverse())
> ? comparators[i].compare(docB, docA)
> : comparators[i].compare(docA, docB);
> }
> // solve tiebreaks by comparing shards (similar to using docid) // smaller 
> docid's beat larger ids, so reverse the natural ordering if (c == 0) {
> c = -docA.shard.compareTo(docB.shard);
> }
> return c < 0;
> }
> {code}
> The last tie-breaking logic is comparing ShardDoc.shard:
> {code:java}
> // solve tiebreaks by comparing shards (similar to using docid)// smaller 
> docid's beat larger ids, so reverse the natural orderingif (c == 0) {
> c = -docA.shard.compareTo(docB.shard);
> }{code}
>  Here ShardDoc.shard contains node ip as well as shard name, for example: 
> [http://127.0.0.1:8983/solr/my_collection_shard1_replica_n1]
> Consider this setup: 1 collection with 2 shard 2 replica running on a 2 nodes 
> cluster. For the same query, we may have documents coming from the following 
> core combinations:
>  # [http://node1_ip:8983/solr/my_collection_shard1_replica_n1] + 
> [http://node2_ip:8983/solr/my_collection_shard2_replica_n2]
>  # [http://node2_ip:8983/solr/my_collection_shard1_replica_n2] + 
> [http://node1_ip:8983/solr/my_collection_shard2_replica_n1]
> Hence the same request may have different document rankings when there are 
> documents from both shards with the same scores. This can get worse with more 
> nodes/shards/replicas. 
> I'm wondering if we should just use the shard name for tie breaking instead 
> (no node ip), if that's possible



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-17976) Solr 9.5 distributed search tie breaking logic is non-deterministic

Reply via email to