Hey Jonah, Thanks for the KIP and PR! I agree with the motivation for the change and just wanted to clarify a bit of the wording:
> The new controller, by becoming a candidate and increasing the epoch will start to trigger frequent leadership elections which it can never win causing further chaos in the cluster. It wasn't clear to me why we assume the new controller would become candidate if other members of the quorum would presumably not grant any pre-vote requests from this node (it is unable to complete fetch requests and therefore unable to catch up sufficiently on the log). Best, Alyssa On Wed, Oct 15, 2025 at 8:42 AM Jonah Hooper <[email protected]> wrote: > Hey Justine, > > Thanks for reading the KIP! > > > For *internal.max.size.bytes, *you mentioned removing this internal > config. Can you clarify this config and how it is used? I couldn't find it > in the code. > > It is used here - > > https://github.com/apache/kafka/blob/616b0234c88c4f265c3545c282f38709e8e4f7f7/core/src/main/scala/kafka/raft/KafkaMetadataLog.scala#L76 > > And defined here - > > https://github.com/apache/kafka/blob/616b0234c88c4f265c3545c282f38709e8e4f7f7/raft/src/main/java/org/apache/kafka/raft/MetadataLogConfig.java#L120 > > Best, > Jonah > > On Tue, Oct 14, 2025 at 2:26 PM Justine Olshan > <[email protected]> > wrote: > > > Hi Jonah, > > > > Thanks for the KIP. > > For now I have one question. > > > > For *internal.max.size.bytes, *you mentioned removing this internal > > config. Can you clarify this config and how it is used? I couldn't find > it > > in the code. > > > > Justine > > > > > > > > On Tue, Oct 7, 2025 at 11:07 AM Jun Rao <[email protected]> > wrote: > > > > > Hi, Jonah, > > > > > > Thanks for the KIP. Just a couple of minor comments. > > > > > > controller.quorum.fetch.timeout.ms defaults to 2 secs. Is that a bit > too > > > low? > > > > > > Orthogonally, in the case when the network latency is long, one can > > > typically tune socket.receive.buffer.bytes to improve the throughput. > > > > > > Jun > > > > > > > > > > > > On Fri, Sep 26, 2025 at 10:29 AM José Armando García Sancio > > > <[email protected]> wrote: > > > > > > > Thanks for the changes Jonah. Looking forward to the implementation > of > > > > this feature. > > > > > > > > On Fri, Sep 26, 2025 at 12:33 PM Jonah Hooper > > > > <[email protected]> wrote: > > > > > > > > > > Thanks for the feedback José! Did some work to apply the KIP to a > > > correct > > > > > format. > > > > > > > > > > The motivation section starts with the solution without motivating > > the > > > > > > problem. I think you want to delete the first paragraph in the > > > > > > motivation section. > > > > > > > > > > > > > > > Done. > > > > > > > > > > In the motivation section you have "non-terminating loop where they > > > > > > cannot join the cluster." They are technically not joining the > > > > > > cluster. The issue is a liveness issue. Because the > FETCH_SNAPSHOT > > > RPC > > > > > > doesn't complete within the fetch timeout configuration the > > fetching > > > > > > replica cannot make progress. In the worst case, if it is a > > > > > > voter/controller, it further disrupts the leader by becoming a > > > > > > candidate and increasing the epoch. > > > > > > > > > > > > > > > Updated the motivation to reflect this. > > > > > > > > > > > > > > > > > > > > > In the public interface section, please make it clear that the > > > changes > > > > > > are the addition of two new configurations. Let's also add > > > > descriptions for > > > > > > these configurations. > > > > > > > > > > > > > > > Added descriptions. > > > > > > > > > > Please fill out "Compatibility, Deprecation, and Migration Plan". > Why > > > > > > is this safe and backward compatible? > > > > > > > > > > > > > > > > Done. > > > > > > > > > > In the test section you have "We can perform once off tests which > > > > > > evaluate the effect of degraded networking on snapshot. It's > better > > > to > > > > > > design system-tests which focus on correctness by using > > > pathologically > > > > > > low values for controller.quorum.fetch.snapshot.max.bytes and > > > > > > controller.quorum.fetch.max.bytes. An example scenario would be a > > > > > > controller joining a quorum with a snapshot with a known size and > > > then > > > > > > attempting to fetch the snapshot in small (say 64kb) or even > > > > > > pathologically small values like 1 byte." I don't fully > understand > > > > > > this recommendation but if my understanding is correct this > sounds > > > > > > very complicated and flakey to implement. Can't we just add > > protocol > > > > > > tests to KafkaRaftClient*Test that show these configurations > being > > > > > > used when handling FETC and FETCH_SNAPSHOT requests? > > > > > > > > > > > > > > > Good point; that section is too specific. I've left it a bit more > > > vague, > > > > > unit tests are sufficient provided we can "mock" a situation where > we > > > > > retrieve less data than is available. > > > > > > > > > > Best, > > > > > Jonah > > > > > > > > > > On Mon, Sep 22, 2025 at 12:31 PM José Armando García Sancio > > > > > <[email protected]> wrote: > > > > > > > > > > > Hi Jonah, > > > > > > > > > > > > Thanks for the changes. The proposed solution looks good to me > at a > > > > > > high-level. I just have some minor comments. > > > > > > > > > > > > The motivation section starts with the solution without > motivating > > > the > > > > > > problem. I think you want to delete the first paragraph in the > > > > > > motivation section. > > > > > > > > > > > > In the motivation section you have "non-terminating loop where > they > > > > > > cannot join the cluster." They are technically not joining the > > > > > > cluster. The issue is a liveness issue. Because the > FETCH_SNAPSHOT > > > RPC > > > > > > doesn't complete within the fetch timeout configuration the > > fetching > > > > > > replica cannot make progress. In the worst case, if it is a > > > > > > voter/controller, it further disrupts the leader by becoming a > > > > > > candidate and increasing the epoch. > > > > > > > > > > > > In the motivation section you have "The proposed default > > > > > > configurations tradeoff a reduction in the throughput of data > > > > > > transmitted by Fetch and FetchSnapshot requests with a liveness > and > > > > > > correctness improvement." This is also the solution and not part > of > > > > > > the motivation. > > > > > > > > > > > > In the public interface section, please make it clear that the > > > changes > > > > > > are the addition of two new configurations. Let's also add > > > > > > descriptions for these configurations. > > > > > > > > > > > > Please fill out "Compatibility, Deprecation, and Migration Plan". > > Why > > > > > > is this safe and backward compatible? > > > > > > > > > > > > In the test section you have "We can perform once off tests which > > > > > > evaluate the effect of degraded networking on snapshot. It's > better > > > to > > > > > > design system-tests which focus on correctness by using > > > pathologically > > > > > > low values for controller.quorum.fetch.snapshot.max.bytes and > > > > > > controller.quorum.fetch.max.bytes. An example scenario would be a > > > > > > controller joining a quorum with a snapshot with a known size and > > > then > > > > > > attempting to fetch the snapshot in small (say 64kb) or even > > > > > > pathologically small values like 1 byte." I don't fully > understand > > > > > > this recommendation but if my understanding is correct this > sounds > > > > > > very complicated and flakey to implement. Can't we just add > > protocol > > > > > > tests to KafkaRaftClient*Test that show these configurations > being > > > > > > used when handling FETC and FETCH_SNAPSHOT requests? > > > > > > > > > > > > I didn't comment on the proposed changes section. I'll comment > > after > > > > > > the comments above are addressed. > > > > > > > > > > > > Thanks > > > > > > -- > > > > > > -José > > > > > > > > > > > > > > > > > > > > > > -- > > > > -José > > > > > > > > > >
