Hello,

On solr v8.7.0 we consistently experience shards with all replicas in DOWN 
state after a node restart and shard rebalanceleaders.

During rebalance we stop all /update requests to all shard.

Once this issue happens, we see “No Servers Hosting Shard” error in solr log 
and only remedy is to manually validate if shard replicas have same index 
segment and files, and then delete the tlog files.

We also found that if we periodically invoke core reload on non-leader replicas 
during indexing, the tlog files on replicas don’t grow huge (100GB+) and number 
of DOWN shards after node restart is less.

We’re not sure if this is a bug or a config problem and would like your help. 
Thank you.


==> solr config

All replicas are TLOG replicas for all shards in the collection
Hard commit is 10sec and soft commit is 5 sec
updatelog numVersionBuckets is 65536. No other value is set for updateLog


==> solr log messages

2021-04-27 09:24:03.556 INFO  (qtp967677821-339) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ZkController 
mycollection_1_c_e_replica_t1281 starting background replication from leader
mydata-solr-1

2021-04-27 09:24:03.556 INFO  (qtp967677821-339) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.ZkController 
mycollection_1_c_e_replica_t1281 stopping background replication from leader
mydata-solr-1

2021-04-27 09:24:03.553 INFO  (qtp967677821-339) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] 
o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader 
parent node, won't remove previous leader registration.
mydata-solr-1

2021-04-27 09:24:03.779 INFO  (zkCallback-14-thread-7) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] 
o.a.s.c.ShardLeaderElectionContext I may be the new leader - try and sync
mydata-solr-1

2021-04-27 09:24:10.027 ERROR (zkCallback-14-thread-7) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: 
core=mycollection_1_c_e_replica_t1281 
url=http://mydata-solr-1.mydata-solr:8983/solr  Requested 37 updates from 
http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ 
but retrieved 28
mydata-solr-1

2021-04-27 09:24:10.027 INFO  (zkCallback-14-thread-7) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.c.SyncStrategy 
Leader's attempt to sync with shard failed, moving to the next candidate
mydata-solr-1

2021-04-27 09:24:10.027 INFO  (zkCallback-14-thread-7) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: 
core=mycollection_1_c_e_replica_t1281 
url=http://mydata-solr-1.mydata-solr:8983/solr  DONE. sync failed
mydata-solr-1


==> solr log message that we see  continuously repeating

2021-04-27 10:00:57.583 ERROR (zkCallback-14-thread-17) [c:mycollection s:1_c_e 
r:core_node1284 x:mycollection_1_c_e_replica_t1281] o.a.s.u.PeerSync PeerSync: 
core=mycollection_1_c_e_replica_t1281 
url=http://mydata-solr-1.mydata-solr:8983/solr  Requested 37 updates from 
http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ 
but retrieved 28
mydata-solr-1

2021-04-27 10:01:01.244 ERROR (zkCallback-14-thread-43) [c:mycollection s:1_c_e 
r:core_node1285 x:mycollection_1_c_e_replica_t1282] o.a.s.u.PeerSync PeerSync: 
core=mycollection_1_c_e_replica_t1282 
url=http://mydata-solr-3.mydata-solr:8983/solr  Requested 30 updates from 
http://mydata-solr-1.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1281/ 
but retrieved 24
mydata-solr-3

2021-04-27 10:01:04.352 ERROR (zkCallback-14-thread-15) [c:mycollection s:1_c_e 
r:core_node1286 x:mycollection_1_c_e_replica_t1283] o.a.s.u.PeerSync PeerSync: 
core=mycollection_1_c_e_replica_t1283 
url=http://mydata-solr-0.mydata-solr:8983/solr  Requested 41 updates from 
http://mydata-solr-3.mydata-solr:8983/solr/mycollection_1_c_e_replica_t1282/ 
but retrieved 32
mydata-solr-0

Reply via email to