Re: Leader Election Slowness

Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) Thu, 17 Apr 2025 12:10:15 -0700

Hi Angad,

Thanks for suggesting this and I could see how this could be a common culprit. 
In our particular case it was not the issue but thank you nonetheless!

For posterity:

Matthew Biscocho and myself did a deep-dive on this issue. It turns out that 
there were a few layers to untangle. Here is the synopsis of a "slow" leader 
election as it manifested itself in our environment:

1. Current leader gets SIGTERM, begins graceful shutdown.
2. Leader skips the leader-election optimization of 
https://issues.apache.org/jira/browse/SOLR-14942 because of this bug 
https://issues.apache.org/jira/browse/SOLR-17745
3. Leader begins process of closing cores. This can take a long time if there 
is a lot of ingestion at the time of shutdown due to ongoing segment merges, 
file flushes, etc.
4. After 5 seconds our container orchestration system sends a SIGKILL while the 
leader is still closing the cores. Thus it never reaches the point of 
zkSys/ZkContainer::close which would otherwise (presumably) delete the 
ephemeral leader election node, signaling the successor to begin taking over 
leadership.
5. We had zkClientTimeout set to the default 30 seconds. Thus only after 30 
*more* seconds of no heartbeat does zk register the ephemeral leader election 
node as being deleted. This finally gets propagated to the successor, but only 
after ~35 seconds of downtime in our case.

Luke

From: users@solr.apache.org At: 04/14/25 09:20:18 UTC-4:00To:  
users@solr.apache.org
Subject: Re: Leader Election Slowness

If its a single leader node, it could be due to the leaderVoteWait
configuration.

On Mon, 14 Apr, 2025, 6:18 pm Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A), <
lkotzanie...@bloomberg.net> wrote:

> Hey All,
>
> I'd like to solicit some feedback from the community regarding leader
> election slowness. This is happening in both Solr 8 and 9 clouds my team
> manages of various sizes. Trying to reproduce this has taken us down the
> Overseer rabbit hole but wanted to grab some oxygen before going too deep
> :-)
>
> For context by "slow" I mean taking 30 seconds or more (sometimes over a
> minute). Typically this results in a big gap between the live nodes being
> updated (visible on all hosts) aka:
>
> > 2025-04-04 22:23:29.150 INFO  ZkStateReader [   ] ?
> [zkCallback-13-thread-26] - Updated live nodes from ZooKeeper... (16) ->
> (15)
>
> Then for ~40 seconds the only consistent thing appears to be indexFetcher
> checking in on whether the index is in-sync, seemingly on a loop.
>
> After 40 seconds we finally see:
>
>
> > 2025-04-04 22:24:01.041 INFO  ZkStateReader [   ] ?
> [zkCallback-13-thread-26] - A cluster state change: [WatchedEvent
> state:SyncConnected type:NodeDataChanged
> path:/collections/some-cloud/state.json zxid: 1133871372341] for collection
> [some_collection] has occurred - updating... (live nodes size: [15])
>
> > 2025-04-04 22:24:03.665 INFO  ShardLeaderElectionContext
> [some_collection shard1 core_node20 some_collection_shard1_replica_t19] ?
> [zkCallback-13-thread-120] - I am the new leader:
> http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ shard1
>
> > 2025-04-04 22:24:03.672 INFO  IndexFetcher [   ] ?
> [indexFetcher-48-thread-1] - Updated leaderUrl to
> http://new-leader-url.com/solr/some_collection_shard1_replica_t19/
>
> ..
>
>
> So I am most puzzled by the initial 40 second gap. For context this
> particular example occurred on a cloud with 2 collections and 31 total
> shards (so not too crazy). Also didn't see anything suspicious in zk
> metrics or logs. Has anyone experienced something similar and, if so, would
> they mind sharing what they found?
>
>
> Finally, research we've done so far:
>
> Trying to reproduce this with debug logs got us down the Overseer rabbit
> hole. Matt Biscocho pointed me at the initiative to remove Overseer
> https://issues.apache.org/jira/browse/SOLR-14927 in lieu of
> distributedClusterStateUpdates with optimistic locking. It would be
> interesting to try out but given our difficulty to reproduce the issue
> consistently we are not yet confident in our ability to measure the impact.
> Also, given that we see this for clouds with 1-2 collections typically with
> up to tens of shards I am not sure how much we will benefit from the extra
> concurrency of cluster state updating. I know folks here do this on a much
> bigger scale in terms of collections per cloud than us.
>
> Thanks,
> Luke

Re: Leader Election Slowness

Reply via email to