[ https://issues.apache.org/jira/browse/KAFKA-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonah Hooper reassigned KAFKA-19541: ------------------------------------ Assignee: Jonah Hooper > KRaft should handle snapshot fetches under slow networks > -------------------------------------------------------- > > Key: KAFKA-19541 > URL: https://issues.apache.org/jira/browse/KAFKA-19541 > Project: Kafka > Issue Type: Improvement > Components: kraft > Reporter: Jonah Hooper > Assignee: Jonah Hooper > Priority: Major > > If a "new" controller does not have any Metadata logs stored and joins a > Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up > to date log. It will perform this from FollowerState. > By default; KRaft allows for 2s to complete all requests before it considers > the active-controller (leader) unavailable. If a request (including > FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in > FollowerState, will transition to CandidateState. If a controller has not > fetched logs from active controller it can never become leader since it has > no data. As such it will eventually transition back to follower state. > If the snapshot on the active controller is larger (in size on disk) than it > would take to download given network conditions between active controller and > new controller, then its possible that the "new" controller will get stuck in > a loop. > In this state it will transition from: > {code:java} > Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... -> > Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code} > Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active > controller and "new" controller is 2Mbs then, it would take 10s complete > FETCH_SNAPSHOT of `xxxx.checkpoint`. > In this case, unless network conditions improve then "new controller" will be > stuck in a loop forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)