If its a single leader node, it could be due to the leaderVoteWait
configuration.

On Mon, 14 Apr, 2025, 6:18 pm Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A), <
lkotzanie...@bloomberg.net> wrote:

> Hey All,
>
> I'd like to solicit some feedback from the community regarding leader
> election slowness. This is happening in both Solr 8 and 9 clouds my team
> manages of various sizes. Trying to reproduce this has taken us down the
> Overseer rabbit hole but wanted to grab some oxygen before going too deep
> :-)
>
> For context by "slow" I mean taking 30 seconds or more (sometimes over a
> minute). Typically this results in a big gap between the live nodes being
> updated (visible on all hosts) aka:
>
> > 2025-04-04 22:23:29.150 INFO  ZkStateReader [   ] ?
> [zkCallback-13-thread-26] - Updated live nodes from ZooKeeper... (16) ->
> (15)
>
> Then for ~40 seconds the only consistent thing appears to be indexFetcher
> checking in on whether the index is in-sync, seemingly on a loop.
>
> After 40 seconds we finally see:
>
>
> > 2025-04-04 22:24:01.041 INFO  ZkStateReader [   ] ?
> [zkCallback-13-thread-26] - A cluster state change: [WatchedEvent
> state:SyncConnected type:NodeDataChanged
> path:/collections/some-cloud/state.json zxid: 1133871372341] for collection
> [some_collection] has occurred - updating... (live nodes size: [15])
>
> > 2025-04-04 22:24:03.665 INFO  ShardLeaderElectionContext
> [some_collection shard1 core_node20 some_collection_shard1_replica_t19] ?
> [zkCallback-13-thread-120] - I am the new leader:
> http://new-leader-url.com/solr/some_collection_shard1_replica_t19/ shard1
>
> > 2025-04-04 22:24:03.672 INFO  IndexFetcher [   ] ?
> [indexFetcher-48-thread-1] - Updated leaderUrl to
> http://new-leader-url.com/solr/some_collection_shard1_replica_t19/
>
> ..
>
>
> So I am most puzzled by the initial 40 second gap. For context this
> particular example occurred on a cloud with 2 collections and 31 total
> shards (so not too crazy). Also didn't see anything suspicious in zk
> metrics or logs. Has anyone experienced something similar and, if so, would
> they mind sharing what they found?
>
>
> Finally, research we've done so far:
>
> Trying to reproduce this with debug logs got us down the Overseer rabbit
> hole. Matt Biscocho pointed me at the initiative to remove Overseer
> https://issues.apache.org/jira/browse/SOLR-14927 in lieu of
> distributedClusterStateUpdates with optimistic locking. It would be
> interesting to try out but given our difficulty to reproduce the issue
> consistently we are not yet confident in our ability to measure the impact.
> Also, given that we see this for clouds with 1-2 collections typically with
> up to tens of shards I am not sure how much we will benefit from the extra
> concurrency of cluster state updating. I know folks here do this on a much
> bigger scale in terms of collections per cloud than us.
>
> Thanks,
> Luke

Reply via email to