On 8/28/23 11:42, Chris Hostetter wrote:
I assume you mean one of the batches always indexes 5 fewer documents then
'rows=N' param (ie: the query batch size) ... correct?
You're talking about the total numFound being higher then the index count?
The query uses rows=10000, which is configurable via a commandline option.
The source collection's numFound is 5 higher than the number of
documents indexed to the target. I was assured that all updates to the
source collection were paused during the most recent migration test.
Also possible is that sme shards are out of sync with their leader -- ie:
for some shardX, replica1 has a doc that replica2 doesn't, and replica1 is
used for the initial phase of the request to get the "top N sorted doc
uniqueKey at cursorMark=ZZZ" but replica2 is used in the second phase to
fetch all of the field values. (but if that were the case, you'd expect
that at least some of the time you'd get "lucky" and the two phases would
both hit replicas that agreeed with eachother -- even if they didn't agree
with the leader -- and the problem wouldn't reliably reproduce every time)
We did make sure that the numDocs was the same on all replicas for each
shard. A comprehensive check of ID values across replicas has not been
done. I should be able to write a program to do that.
: should keep that from happening. Is there a way to detect this situation? I
I would log every cursorMark request URL and the number of docs in the
response.
It has been verified that each cursorMark batch is 10000 docs except the
last batch, by checking the size of the SolrDocumentList object
retrieved from the response. Added some debug-level logging to show
that along with the cursorMark value.
I have finished my SolrJ program using Http2SolrClient that will look
for IDs that exist in more than one shard. I had hoped to have it get
the list of core URLs from ZK, but couldn't figure that out, so now the
commandline options accept multiple core-specific URLs, with the idea
that one replica core from each shard will be presented. I have tested
it against my little Solr install, with the first URL pointing at the
collection alias and the second pointing at the real core. It's a
single-shard collection on a single node. As expected, it reported that
every ID was duplicated. We'll try it for real in the wee hours of the
morning.
I put the program on github if anyone is interested in taking a look.
https://github.com/elyograg/shard_duplicate_finder
Thanks,
Shawn