Hi Michael,

I have 8 shards (on 8 different nods) and no replicas with about 500 million 
documents.  Additionally, I have a collection with just 2 shards and no 
replicas (and significantly fewer documents) where I see the same behavior.  I 
do observe this behavior even when I route the query through the same "entry 
node".  To see this behavior, I can just hit refresh on the same query several 
times.  Most of the time, the scores do reflect a distributed IDF, but 
sometimes scores that reflect the IDF of only one of the shards (even though 
documents from both shards are returned).

Thanks!
Cameron VandenBerg

-----Original Message-----
From: Michael Gibney <mich...@michaelgibney.net> 
Sent: Monday, March 22, 2021 10:20 PM
To: users@solr.apache.org
Subject: Re: Distributed IDF for Solr using ExactStatsCache issue

Cameron,
What is your cluster configuration? i.e., how many nodes, how many replicas per 
node, how many replicas in each collection, etc.? Do you observe consistent 
behavior for the same query if you always route that query via the same "entry 
node" (i.e., not load balanced over the cluster)?
Michael

On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg <c...@cs.cmu.edu>
wrote:

> Hello,
>
> I am using Solr in a distributed environment where I have split my 
> collection into parts, which I have running on different nodes.  When 
> I create each part of the collection, I set numShards and 
> replicationFactor to 1.  The query speed is most important to us, and 
> we are not worried about load on the system.
>
> I want a Distributed IDF across all parts of the collection so I have 
> added this line to my solrconfig.xml:
> <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
>
> This seems to work about 90% of the time, but if I run the same 
> request over and over again, sometimes I get scores with a local IDF 
> for just one part of the collection.  Here is a request example:
>
> /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,ur
> l,title,score&sort=score+desc
>
> I still get documents from both collection1 and collection2, but 
> sometimes I get scores that are the same as when I would just query 
> collection1.  I believe that it is only using the document frequency 
> of collection one for the term in that case.
>
> Should I use a different configuration?  I would like to make sure the 
> IDF is always distributed and the same every time I run the same 
> query.  Is there any technique I could use to ensure that this happens?
>
> Thank you,
> Cameron VandenBerg
>
>

Reply via email to