Jonah Hooper created KAFKA-19541:
------------------------------------
Summary: KRaft should handle snapshot fetches under slow networks
Key: KAFKA-19541
URL: https://issues.apache.org/jira/browse/KAFKA-19541
Project: Kafka
Issue Type: Improvement
Components: kraft
Reporter: Jonah Hooper
If a "new" controller does not have any Metadata logs stored and joins a Quorum
it will attempt a FETCH_SNAPSHOT to active controller to receive an up to date
log. It will perform this from FollowerState.
By default; KRaft allows for 2s to complete all requests before it considers
the active-controller (leader) unavailable. If a request (including
FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in
FollowerState, will transition to CandidateState. If a controller has not
fetched logs from active controller it can never become leader since it has no
data. As such it will eventually transition back to follower state.
If the snapshot on the active controller is larger (in size on disk) than it
would take to download given network conditions between active controller and
new controller, then its possible that the "new" controller will get stuck in a
loop.
In this state it will transition from:
{code:java}
Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... ->
Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active
controller and "new" controller is 2Mbs then, it would take 10s complete
FETCH_SNAPSHOT of `xxxx.checkpoint`.
In this case, unless network conditions improve then "new controller" will be
stuck in a loop forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)