When calculating the DF (document frequency) component of a BM25 score,
Apache Lucene BM25 similarity uses:

org.apache.lucene.search.similarities.BM25Similarity#idfExplain(org.apache.lucene.search.CollectionStatistics,
org.apache.lucene.search.TermStatistics)
*Note that CollectionStatistics.docCount() is used instead of
IndexReader#numDocs() because also TermStatistics.docFreq() is used, and
when the latter is inaccurate, so is CollectionStatistics.docCount(), and
in the same direction. In addition, CollectionStatistics.docCount() does
not skew when fields are sparse.*

*CollectionStatistics.docCount() *
The total number of documents that have at least one term for this field.
This value is always a positive number, and never exceeds maxDoc().
Returns:
total number of documents containing this field, in the range [1 ..
maxDoc()]
See Also:
Returns the number of documents that have at least one term for this field.
Note that, just like other term measures, this measure does not take
deleted documents into account.

So, docCount should be fine with deletes now ( it has been problematic for
a while: https://issues.apache.org/jira/browse/LUCENE-6711 )

*Distributed DF*
Your problem may be related to distributed document frequency, when
calculating the score, you may end up using the DF of the shard containing
the doc, and this could be skewed across your collection and cause almost
identical looking documents to score differently:
https://solr.apache.org/guide/8_8/distributed-requests.html#distributedidf
https://sease.io/2017/11/distributed-search-tips-for-apache-solr.html
https://solr.pl/en/2019/05/20/distributed-idf/

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant
www.sease.io


On Mon, 22 Mar 2021 at 07:54, Dominique Bejean <dominique.bej...@eolya.fr>
wrote:

> Hi,
>
> If your replicas are all NRT, they both index documents. Their commit and
> segment merge cycles are independant and so yes, see different MaxDoc and
> DeletedDoc for each replicas is normal.
>
> We can expect BM25 doesn't care about deleted docs, but I can't answer with
> certainty.
>
> Regards.
>
> Dominique
>
>
> Le dim. 21 mars 2021 à 14:42, Jae Joo <jaejo...@gmail.com> a écrit :
>
> > solr 8.6.2.
> >
> > I have a collection with 48 shards and 30 seconds softcommit and 2
> minutes
> > hardcommit (opensearcher=false)
> >
> > I found that two replicas have exactly Num Docs, but different Max Doc
> and
> > Deleted Decs.
> >
> > While I am running the same query many times, I am seeing the max score
> is
> > different. The numFound of the query is exactly same.
> >
> > Once I run *an expungeDeletes*, it returns the same max score.
> >
> > Questions I have:
> > 1. is different MaxDoc and Deleted Doc of replicas normal?
> > 2. Is BM25 cares maxDocs and/or deleted Doc?
> >
> > Jae
> >
>

Reply via email to