showuon commented on PR #14428: URL: https://github.com/apache/kafka/pull/14428#issuecomment-1759306603
> I'm wondering if we may start causing leaders to resign when followers are slow/backlogged and make the situation worse? E.g. if we have multiple followers that need to catch up via a large fetch snapshot, they are unable to fetch again prior to the timeout expiring, and cause the current leader to resign. I don't believe this would be very disruptive but wanted to check folks had considered this/similar situation. Yes, with 1.5x of timeout, this issue should be resolved. Also, if one follower is slow due to whatever reason, and doesn't fetch again within fetch timeout, it'll also start a new election. That's already the current implementation. > I think we can also modify QUORUM_FETCH_TIMEOUT_MS_DOC to be slightly more explicit too (i.e. Maximum time a leader can go without receiving valid fetch or fetchsnapshot request from a majority of the quorum before resigning or something slightly different if we choose to use 1.5x) Doc updated. I don't think we need to mention anything about 1.5x because that's the implementation detail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
