Hello, On solr v8.7.0 we consistently experience shards with all replicas in DOWN state after a node restart and shard rebalanceleaders.
During rebalance we stop all /update requests to all shard. Once this issue happens, we see “No Servers Hosting Shard” error in solr log and only remedy is to manually validate if shard replicas have same index segment and files, and then delete the tlog files. We also found that if we periodically invoke core reload on non-leader replicas during indexing, the tlog files on replicas don’t grow huge (100GB+) and number of DOWN shards after node restart is less. We’re not sure if this is a bug or a config problem and would like your help. Thank you. ==> solr config All replicas are TLOG replicas for all shards in the collection Hard commit is 10sec and soft commit is 5 sec updatelog numVersionBuckets is 65536. No other value is set for updateLog ==> solr log messages 2021-04-27 09:24:03.556 INFO (qtp967677821-339) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ZkController mycollection_1_c_e_replica_t1281 starting background replication from leader mydata-solr-1 2021-04-27 09:24:03.556 INFO (qtp967677821-339) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ZkController mycollection_1_c_e_replica_t1281 stopping background replication from leader mydata-solr-1 2021-04-27 09:24:03.553 INFO (qtp967677821-339) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader parent node, won't remove previous leader registration. mydata-solr-1 2021-04-27 09:24:03.779 INFO (zkCallback-14-thread-7) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ShardLeaderElectionContext I may be the new leader - try and sync mydata-solr-1 2021-04-27 09:24:10.027 ERROR (zkCallback-14-thread-7) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: core=mycollection_1_c_e_replica_t1281 url=http://mydata-solr-1.mydata-solr:8983/solr Requested 37 updates from http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ but retrieved 28 mydata-solr-1 2021-04-27 09:24:10.027 INFO (zkCallback-14-thread-7) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to the next candidate mydata-solr-1 2021-04-27 09:24:10.027 INFO (zkCallback-14-thread-7) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: core=mycollection_1_c_e_replica_t1281 url=http://mydata-solr-1.mydata-solr:8983/solr DONE. sync failed mydata-solr-1 ==> solr log message that we see continuously repeating 2021-04-27 10:00:57.583 ERROR (zkCallback-14-thread-17) [c:mycollection s:1_c_e r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: core=mycollection_1_c_e_replica_t1281 url=http://mydata-solr-1.mydata-solr:8983/solr Requested 37 updates from http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ but retrieved 28 mydata-solr-1 2021-04-27 10:01:01.244 ERROR (zkCallback-14-thread-43) [c:mycollection s:1_c_e r:core_node1285 x:mycollection_1_c_e_replica_t1282] o.a.s.u.PeerSync PeerSync: core=mycollection_1_c_e_replica_t1282 url=http://mydata-solr-3.mydata-solr:8983/solr Requested 30 updates from http://mydata-solr-1.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1281/ but retrieved 24 mydata-solr-3 2021-04-27 10:01:04.352 ERROR (zkCallback-14-thread-15) [c:mycollection s:1_c_e r:core_node1286 x:mycollection_1_c_e_replica_t1283] o.a.s.u.PeerSync PeerSync: core=mycollection_1_c_e_replica_t1283 url=http://mydata-solr-0.mydata-solr:8983/solr Requested 41 updates from http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ but retrieved 32 mydata-solr-0