Re: Weird issue -- pulling results with cursorMark gets fewer documents than numFound

Shawn Heisey Mon, 28 Aug 2023 11:50:40 -0700

On 8/28/23 11:42, Chris Hostetter wrote:

I assume you mean one of the batches always indexes 5 fewer documents then
'rows=N' param (ie: the query batch size) ... correct?


You're talking about the total numFound being higher then the index count?


The query uses rows=10000, which is configurable via a commandline option.

The source collection's numFound is 5 higher than the number ofdocuments indexed to the target. I was assured that all updates to thesource collection were paused during the most recent migration test.

Also possible is that sme shards are out of sync with their leader -- ie:
for some shardX, replica1 has a doc that replica2 doesn't, and replica1 is
used for the initial phase of the request to get the "top N sorted doc
uniqueKey at cursorMark=ZZZ" but replica2 is used in the second phase to
fetch all of the field values.  (but if that were the case, you'd expect
that at least some of the time you'd get "lucky" and the two phases would
both hit replicas that agreeed with eachother -- even if they didn't agree
with the leader -- and the problem wouldn't reliably reproduce every time)

We did make sure that the numDocs was the same on all replicas for eachshard. A comprehensive check of ID values across replicas has not beendone. I should be able to write a program to do that.

: should keep that from happening.  Is there a way to detect this situation?  I

I would log every cursorMark request URL and the number of docs in the
response.

It has been verified that each cursorMark batch is 10000 docs except thelast batch, by checking the size of the SolrDocumentList objectretrieved from the response. Added some debug-level logging to showthat along with the cursorMark value.

I have finished my SolrJ program using Http2SolrClient that will lookfor IDs that exist in more than one shard. I had hoped to have it getthe list of core URLs from ZK, but couldn't figure that out, so now thecommandline options accept multiple core-specific URLs, with the ideathat one replica core from each shard will be presented. I have testedit against my little Solr install, with the first URL pointing at thecollection alias and the second pointing at the real core. It's asingle-shard collection on a single node. As expected, it reported thatevery ID was duplicated. We'll try it for real in the wee hours of themorning.


I put the program on github if anyone is interested in taking a look.

https://github.com/elyograg/shard_duplicate_finder

Thanks,
Shawn

Re: Weird issue -- pulling results with cursorMark gets fewer documents than numFound

Reply via email to