Hi Michael, I have 8 shards (on 8 different nods) and no replicas with about 500 million documents. Additionally, I have a collection with just 2 shards and no replicas (and significantly fewer documents) where I see the same behavior. I do observe this behavior even when I route the query through the same "entry node". To see this behavior, I can just hit refresh on the same query several times. Most of the time, the scores do reflect a distributed IDF, but sometimes scores that reflect the IDF of only one of the shards (even though documents from both shards are returned).
Thanks! Cameron VandenBerg -----Original Message----- From: Michael Gibney <mich...@michaelgibney.net> Sent: Monday, March 22, 2021 10:20 PM To: users@solr.apache.org Subject: Re: Distributed IDF for Solr using ExactStatsCache issue Cameron, What is your cluster configuration? i.e., how many nodes, how many replicas per node, how many replicas in each collection, etc.? Do you observe consistent behavior for the same query if you always route that query via the same "entry node" (i.e., not load balanced over the cluster)? Michael On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg <c...@cs.cmu.edu> wrote: > Hello, > > I am using Solr in a distributed environment where I have split my > collection into parts, which I have running on different nodes. When > I create each part of the collection, I set numShards and > replicationFactor to 1. The query speed is most important to us, and > we are not worried about load on the system. > > I want a Distributed IDF across all parts of the collection so I have > added this line to my solrconfig.xml: > <statsCache class="org.apache.solr.search.stats.ExactStatsCache" /> > > This seems to work about 90% of the time, but if I run the same > request over and over again, sometimes I get scores with a local IDF > for just one part of the collection. Here is a request example: > > /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,ur > l,title,score&sort=score+desc > > I still get documents from both collection1 and collection2, but > sometimes I get scores that are the same as when I would just query > collection1. I believe that it is only using the document frequency > of collection one for the term in that case. > > Should I use a different configuration? I would like to make sure the > IDF is always distributed and the same every time I run the same > query. Is there any technique I could use to ensure that this happens? > > Thank you, > Cameron VandenBerg > >