Hello,

I have a SolrCloud with 5 shards 2 Replicas.
I tried everything back and forth with LocalStatsCache, ExactStatsCache and 
ExactSharedStatsCache.
I saw some minor advantage between LocalStatsCache and the Exact... pieces.
But as a matter of fact while showing 10 search results per page, as soon
as I switched to the second page (hit 11 to 20) and forced page reload a couple
of times, the results changed within the page. A result showing up as hit
number 14 was listed as hit number 16 next time. And so on. Nothing reliable.
Only the first page looked good.
While inspecting the score I saw that there are minor changes between reloads,
even with ExactStatsCache and ExactSharedStatsCache.
Some more checks on the Replicas pointed out that they are never totally in 
sync.
That means the number of docs and segment count are in sync but nothing else.

coll1_shard1_replica1:
Num Docs: 53576786
Max Doc:  57506559
Deleted Docs: 3929773
Version:  135351
Master (Searching)  1616078264682  22756
Master (Replicable) 1616402397518  22844

coll1_shard1_replica2:
Num Docs: 53576786
Max Doc:  57494890
Deleted Docs: 3918104
Version:  135326
Master (Searching)  1616078264683  22755
Master (Replicable) 1616402397521  22843

Only Num Docs is the same (that is why we always get the same number of hits
and also the same hits) but everything else is different.
I think this is why we newer get the same order of results if using 
ExactStatsCache
or ExactSharedStatsCache. We are using CloudSolrj for loading.

I did once a test and forced an optimize to the index.
First commit with expungeDeletes true and then an optimize to maxSegments 1.
After that everything worked fine and the results stayed in order.
But some weeks later the segment numbers drifted apart and the problem was 
there again.

I think that will never work correct.
Only if replicas are totally in sync against each other it might work.
Just my findings without debugging into code.

Regards
Bernd


Am 19.03.21 um 16:15 schrieb Cameron M VandenBerg:
Hello,

I am using Solr in a distributed environment where I have split my collection 
into parts, which I have running on different nodes.  When I create each part 
of the collection, I set numShards and replicationFactor to 1.  The query speed 
is most important to us, and we are not worried about load on the system.

I want a Distributed IDF across all parts of the collection so I have added 
this line to my solrconfig.xml:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache" />

This seems to work about 90% of the time, but if I run the same request over 
and over again, sometimes I get scores with a local IDF for just one part of 
the collection.  Here is a request example:
/solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,url,title,score&sort=score+desc

I still get documents from both collection1 and collection2, but sometimes I 
get scores that are the same as when I would just query collection1.  I believe 
that it is only using the document frequency of collection one for the term in 
that case.

Should I use a different configuration?  I would like to make sure the IDF is 
always distributed and the same every time I run the same query.  Is there any 
technique I could use to ensure that this happens?

Thank you,
Cameron VandenBerg


Reply via email to