When calculating the DF (document frequency) component of a BM25 score, Apache Lucene BM25 similarity uses:
org.apache.lucene.search.similarities.BM25Similarity#idfExplain(org.apache.lucene.search.CollectionStatistics, org.apache.lucene.search.TermStatistics) *Note that CollectionStatistics.docCount() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.docCount(), and in the same direction. In addition, CollectionStatistics.docCount() does not skew when fields are sparse.* *CollectionStatistics.docCount() * The total number of documents that have at least one term for this field. This value is always a positive number, and never exceeds maxDoc(). Returns: total number of documents containing this field, in the range [1 .. maxDoc()] See Also: Returns the number of documents that have at least one term for this field. Note that, just like other term measures, this measure does not take deleted documents into account. So, docCount should be fine with deletes now ( it has been problematic for a while: https://issues.apache.org/jira/browse/LUCENE-6711 ) *Distributed DF* Your problem may be related to distributed document frequency, when calculating the score, you may end up using the DF of the shard containing the doc, and this could be skewed across your collection and cause almost identical looking documents to score differently: https://solr.apache.org/guide/8_8/distributed-requests.html#distributedidf https://sease.io/2017/11/distributed-search-tips-for-apache-solr.html https://solr.pl/en/2019/05/20/distributed-idf/ Cheers -------------------------- Alessandro Benedetti Apache Lucene/Solr Committer Director, R&D Software Engineer, Search Consultant www.sease.io On Mon, 22 Mar 2021 at 07:54, Dominique Bejean <dominique.bej...@eolya.fr> wrote: > Hi, > > If your replicas are all NRT, they both index documents. Their commit and > segment merge cycles are independant and so yes, see different MaxDoc and > DeletedDoc for each replicas is normal. > > We can expect BM25 doesn't care about deleted docs, but I can't answer with > certainty. > > Regards. > > Dominique > > > Le dim. 21 mars 2021 à 14:42, Jae Joo <jaejo...@gmail.com> a écrit : > > > solr 8.6.2. > > > > I have a collection with 48 shards and 30 seconds softcommit and 2 > minutes > > hardcommit (opensearcher=false) > > > > I found that two replicas have exactly Num Docs, but different Max Doc > and > > Deleted Decs. > > > > While I am running the same query many times, I am seeing the max score > is > > different. The numFound of the query is exactly same. > > > > Once I run *an expungeDeletes*, it returns the same max score. > > > > Questions I have: > > 1. is different MaxDoc and Deleted Doc of replicas normal? > > 2. Is BM25 cares maxDocs and/or deleted Doc? > > > > Jae > > >