Hi Angad, Thanks for suggesting this and I could see how this could be a common culprit. In our particular case it was not the issue but thank you nonetheless!
For posterity: Matthew Biscocho and myself did a deep-dive on this issue. It turns out that there were a few layers to untangle. Here is the synopsis of a "slow" leader election as it manifested itself in our environment: 1. Current leader gets SIGTERM, begins graceful shutdown. 2. Leader skips the leader-election optimization of https://issues.apache.org/jira/browse/SOLR-14942 because of this bug https://issues.apache.org/jira/browse/SOLR-17745 3. Leader begins process of closing cores. This can take a long time if there is a lot of ingestion at the time of shutdown due to ongoing segment merges, file flushes, etc. 4. After 5 seconds our container orchestration system sends a SIGKILL while the leader is still closing the cores. Thus it never reaches the point of zkSys/ZkContainer::close which would otherwise (presumably) delete the ephemeral leader election node, signaling the successor to begin taking over leadership. 5. We had zkClientTimeout set to the default 30 seconds. Thus only after 30 *more* seconds of no heartbeat does zk register the ephemeral leader election node as being deleted. This finally gets propagated to the successor, but only after ~35 seconds of downtime in our case. Luke From: users@solr.apache.org At: 04/14/25 09:20:18 UTC-4:00To: users@solr.apache.org Subject: Re: Leader Election Slowness If its a single leader node, it could be due to the leaderVoteWait configuration. On Mon, 14 Apr, 2025, 6:18 pm Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A), < lkotzanie...@bloomberg.net> wrote: > Hey All, > > I'd like to solicit some feedback from the community regarding leader > election slowness. This is happening in both Solr 8 and 9 clouds my team > manages of various sizes. Trying to reproduce this has taken us down the > Overseer rabbit hole but wanted to grab some oxygen before going too deep > :-) > > For context by "slow" I mean taking 30 seconds or more (sometimes over a > minute). Typically this results in a big gap between the live nodes being > updated (visible on all hosts) aka: > > > 2025-04-04 22:23:29.150 INFO ZkStateReader [ ] ? > [zkCallback-13-thread-26] - Updated live nodes from ZooKeeper... (16) -> > (15) > > Then for ~40 seconds the only consistent thing appears to be indexFetcher > checking in on whether the index is in-sync, seemingly on a loop. > > After 40 seconds we finally see: > > > > 2025-04-04 22:24:01.041 INFO ZkStateReader [ ] ? > [zkCallback-13-thread-26] - A cluster state change: [WatchedEvent > state:SyncConnected type:NodeDataChanged > path:/collections/some-cloud/state.json zxid: 1133871372341] for collection > [some_collection] has occurred - updating... (live nodes size: [15]) > > > 2025-04-04 22:24:03.665 INFO ShardLeaderElectionContext > [some_collection shard1 core_node20 some_collection_shard1_replica_t19] ? > [zkCallback-13-thread-120] - I am the new leader: > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ shard1 > > > 2025-04-04 22:24:03.672 INFO IndexFetcher [ ] ? > [indexFetcher-48-thread-1] - Updated leaderUrl to > http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ > > .. > > > So I am most puzzled by the initial 40 second gap. For context this > particular example occurred on a cloud with 2 collections and 31 total > shards (so not too crazy). Also didn't see anything suspicious in zk > metrics or logs. Has anyone experienced something similar and, if so, would > they mind sharing what they found? > > > Finally, research we've done so far: > > Trying to reproduce this with debug logs got us down the Overseer rabbit > hole. Matt Biscocho pointed me at the initiative to remove Overseer > https://issues.apache.org/jira/browse/SOLR-14927 in lieu of > distributedClusterStateUpdates with optimistic locking. It would be > interesting to try out but given our difficulty to reproduce the issue > consistently we are not yet confident in our ability to measure the impact. > Also, given that we see this for clouds with 1-2 collections typically with > up to tens of shards I am not sure how much we will benefit from the extra > concurrency of cluster state updating. I know folks here do this on a much > bigger scale in terms of collections per cloud than us. > > Thanks, > Luke