Colin McCabe created KAFKA-17793: ------------------------------------ Summary: Improve kcontroller robustness against long delays Key: KAFKA-17793 URL: https://issues.apache.org/jira/browse/KAFKA-17793 Project: Kafka Issue Type: Bug Reporter: Colin McCabe Assignee: Colin McCabe
As described in KIP-500, the Kafka controller monitors the liveness of each broker in the cluster. It gathers this information from heartbeats sent from the brokers themselves. In some rare cases, the main controller thread may get blocked for several seconds at a time. In the current code, this will result in the controller being unable to update the last contact times for the brokers during this time. This PR changes the controller heartbeat handling to be partially lockless. Specifically, the last contact time for each broker will be updated locklessly prior to the rest of the heartbeat handling. This will ensure that heartbeats always get through. Additionally, this PR adds a PeriodicTaskControlManager to better manage periodic tasks. This should help handle the very common pattern where we want to schedule a background task at some frequency. We also want the background task to be immediately rescheduled if there is too much work to be done in one event. -- This message was sent by Atlassian Jira (v8.20.10#820010)