[jira] [Assigned] (KAFKA-19541) KRaft should handle snapshot fetches under slow networks

Jonah Hooper (Jira) Fri, 25 Jul 2025 08:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonah Hooper reassigned KAFKA-19541:
------------------------------------

    Assignee: Jonah Hooper

> KRaft should handle snapshot fetches under slow networks
> --------------------------------------------------------
>
>                 Key: KAFKA-19541
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19541
>             Project: Kafka
>          Issue Type: Improvement
>          Components: kraft
>            Reporter: Jonah Hooper
>            Assignee: Jonah Hooper
>            Priority: Major
>
> If a "new" controller does not have any Metadata logs stored and joins a 
> Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up 
> to date log. It will perform this from FollowerState. 
> By default; KRaft allows for 2s to complete all requests before it considers 
> the active-controller (leader) unavailable. If a request (including 
> FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in 
> FollowerState, will transition to CandidateState. If a controller has not 
> fetched logs from active controller it can never become leader since it has 
> no data. As such it will eventually transition back to follower state. 
> If the snapshot on the active controller is larger (in size on disk) than it 
> would take to download given network conditions between active controller and 
> new controller, then its possible that the "new" controller will get stuck in 
> a loop. 
> In this state it will transition from:
> {code:java}
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... -> 
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
> Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active 
> controller and "new" controller is 2Mbs then, it would take 10s complete 
> FETCH_SNAPSHOT of  `xxxx.checkpoint`. 
> In this case, unless network conditions improve then "new controller" will be 
> stuck in a loop forever. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-19541) KRaft should handle snapshot fetches under slow networks

Reply via email to