Hey All, I'd like to solicit some feedback from the community regarding leader election slowness. This is happening in both Solr 8 and 9 clouds my team manages of various sizes. Trying to reproduce this has taken us down the Overseer rabbit hole but wanted to grab some oxygen before going too deep :-)
For context by "slow" I mean taking 30 seconds or more (sometimes over a minute). Typically this results in a big gap between the live nodes being updated (visible on all hosts) aka: > 2025-04-04 22:23:29.150 INFO ZkStateReader [ ] ? [zkCallback-13-thread-26] > - Updated live nodes from ZooKeeper... (16) -> (15) Then for ~40 seconds the only consistent thing appears to be indexFetcher checking in on whether the index is in-sync, seemingly on a loop. After 40 seconds we finally see: > 2025-04-04 22:24:01.041 INFO ZkStateReader [ ] ? [zkCallback-13-thread-26] > - A cluster state change: [WatchedEvent state:SyncConnected > type:NodeDataChanged path:/collections/some-cloud/state.json zxid: > 1133871372341] for collection [some_collection] has occurred - updating... > (live nodes size: [15]) > 2025-04-04 22:24:03.665 INFO ShardLeaderElectionContext [some_collection > shard1 core_node20 some_collection_shard1_replica_t19] ? > [zkCallback-13-thread-120] - I am the new leader: > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ shard1 > 2025-04-04 22:24:03.672 INFO IndexFetcher [ ] ? [indexFetcher-48-thread-1] > - Updated leaderUrl to > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ .. So I am most puzzled by the initial 40 second gap. For context this particular example occurred on a cloud with 2 collections and 31 total shards (so not too crazy). Also didn't see anything suspicious in zk metrics or logs. Has anyone experienced something similar and, if so, would they mind sharing what they found? Finally, research we've done so far: Trying to reproduce this with debug logs got us down the Overseer rabbit hole. Matt Biscocho pointed me at the initiative to remove Overseer https://issues.apache.org/jira/browse/SOLR-14927 in lieu of distributedClusterStateUpdates with optimistic locking. It would be interesting to try out but given our difficulty to reproduce the issue consistently we are not yet confident in our ability to measure the impact. Also, given that we see this for clouds with 1-2 collections typically with up to tens of shards I am not sure how much we will benefit from the extra concurrency of cluster state updating. I know folks here do this on a much bigger scale in terms of collections per cloud than us. Thanks, Luke