rondagostino opened a new pull request, #12856: URL: https://github.com/apache/kafka/pull/12856
KRaft brokers maintain their liveness in the cluster by sending BROKER_HEARTBEAT requests to the active controller; the active controller fences a broker if it doesn't receive a heartbeat request from that broker within the period defined by `broker.session.timeout.ms`. The broker should use a request timeout for its BROKER_HEARTBEAT requests that is not larger than the session timeout being used by the controller; doing so creates the possibility that upon controller failover the broker might fail to cancel an existing heartbeat request in time and then subsequently heartbeat to the new controller to maintain an uninterrupted session in the cluster. In other words, a failure of the active controller could result in under-replicated (or under-min ISR) partitions simply due to a delay in brokers heartbeating to the new controller. This patch adds documentation to that effect and sets the `controller.socket.timeout.ms` config accordingly in the quickstart files. It also makes a change in `BrokerToControllerChannelManager.scala` to set the default request timeout to be equal to the value of `controller.socket.timeout.ms` rather than the generic `request.timeout.ms` -- but this default timeout value is not used by the BrokerToControllerChannelManager functionality, so this change is simply cosmetic at this time. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org