If its a single leader node, it could be due to the leaderVoteWait configuration.
On Mon, 14 Apr, 2025, 6:18 pm Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A), < lkotzanie...@bloomberg.net> wrote: > Hey All, > > I'd like to solicit some feedback from the community regarding leader > election slowness. This is happening in both Solr 8 and 9 clouds my team > manages of various sizes. Trying to reproduce this has taken us down the > Overseer rabbit hole but wanted to grab some oxygen before going too deep > :-) > > For context by "slow" I mean taking 30 seconds or more (sometimes over a > minute). Typically this results in a big gap between the live nodes being > updated (visible on all hosts) aka: > > > 2025-04-04 22:23:29.150 INFO ZkStateReader [ ] ? > [zkCallback-13-thread-26] - Updated live nodes from ZooKeeper... (16) -> > (15) > > Then for ~40 seconds the only consistent thing appears to be indexFetcher > checking in on whether the index is in-sync, seemingly on a loop. > > After 40 seconds we finally see: > > > > 2025-04-04 22:24:01.041 INFO ZkStateReader [ ] ? > [zkCallback-13-thread-26] - A cluster state change: [WatchedEvent > state:SyncConnected type:NodeDataChanged > path:/collections/some-cloud/state.json zxid: 1133871372341] for collection > [some_collection] has occurred - updating... (live nodes size: [15]) > > > 2025-04-04 22:24:03.665 INFO ShardLeaderElectionContext > [some_collection shard1 core_node20 some_collection_shard1_replica_t19] ? > [zkCallback-13-thread-120] - I am the new leader: > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ shard1 > > > 2025-04-04 22:24:03.672 INFO IndexFetcher [ ] ? > [indexFetcher-48-thread-1] - Updated leaderUrl to > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ > > .. > > > So I am most puzzled by the initial 40 second gap. For context this > particular example occurred on a cloud with 2 collections and 31 total > shards (so not too crazy). Also didn't see anything suspicious in zk > metrics or logs. Has anyone experienced something similar and, if so, would > they mind sharing what they found? > > > Finally, research we've done so far: > > Trying to reproduce this with debug logs got us down the Overseer rabbit > hole. Matt Biscocho pointed me at the initiative to remove Overseer > https://issues.apache.org/jira/browse/SOLR-14927 in lieu of > distributedClusterStateUpdates with optimistic locking. It would be > interesting to try out but given our difficulty to reproduce the issue > consistently we are not yet confident in our ability to measure the impact. > Also, given that we see this for clouds with 1-2 collections typically with > up to tens of shards I am not sure how much we will benefit from the extra > concurrency of cluster state updating. I know folks here do this on a much > bigger scale in terms of collections per cloud than us. > > Thanks, > Luke