Source Solr 4.7 SolrCloud, 3 shards, 7 replicas in the collection. Target Solr 9.1.1 SolrCloud, 3 shards and 3 replicas.
Source version is a custom 4.7.0 version that mentions it includes SOLR-5875, which is a very small patch. Target version is unmodified Solr 9.1.1. The client on this is unwilling to change versions.
Schema meets the requirements for Atomic Update, so we are doing a migration by querying the old cluster and writing to the new cluster. We are doing it in batches by filtering on one of the fields, and using cursorMark to efficiently page through the results.
The query thread gets batches of 10000 documents and dumps them on a queue, which is then processed by indexing threads. The query side uses Http2SolrClient with a URL, the target uses CloudHttp2SolrClient with zk info, and sets the option to send only to shard leaders. The source collecton is NRT because that's all that 4.7 supports, the target is TLOG. Both SolrClient objects are set to use HTTP 1.1.
One of the batches always indexes 5 fewer documents than numFound. It's consistent -- always 5 documents. Updates are paused during the migration. On the last run, numFound for this batch was 3824942 and the indexed count was 3824937.
The query batches are always 10000 except for the last one, which is 4937. The index batches are always 1000 except for the last one, which is 937.
It probably doesn't matter, but the queue size is 500000. There are two index threads.
I don't think there is a problem with the migration code. The other batches (created with a filter query) are all working properly ... the number of documents indexed matches the numFound. Total number of documents is a little over 30 million, so this batch is a little over 10 percent of the total.
Has anyone seen a problem on 4.7.0 where numFound doesn't match the total document count retrieved with cursorMark? The only thing I can imagine that would cause this is having a different numDocs count in each replica, but we have verified that these counts are all the same in every replica of each shard.
The other idea I have is that there could be a uniqueKey value that appears in more than one shard. This doesn't seem likely, as the compositeId router should keep that from happening. Is there a way to detect this situation? I have an idea for a SolrJ program that would detect it, I am just hoping that Solr 4.7 might have something built in.
Thanks, Shawn