Hi Jose, Re: JS1
I agree that losing a majority of controllers can result in metadata loss -- stale ISR/ELR or missing leader epochs could lead to unclean leader elections or write truncation, and that's not something Kafka can cleanly recover from. That said, in practice, operators facing permanent loss of 2 of 3 controllers still need some path to resume operations, even accepting data loss. After overriding the voter set, the controller's stale ISR view would gradually get corrected -- AlterPartition from active leaders updates ISR membership, and broker fencing removes dead brokers after heartbeat timeout. However, leader elections made before that convergence would be based on stale ISR, potentially electing a replica that has fallen far behind or should have been removed from the ISR -- leading to truncation of writes that the old leader had accepted. And ISR convergence is only part of the recovery picture -- other lost metadata (topic creation/deletion records, partition reassignments, config changes) that was committed but not replicated to the surviving controller cannot self-heal via AlterPartition. Would you see KIP-1347's voter set override as a reasonable mechanism to resume cluster operations in this scenario, with the understanding that some metadata loss is unrecoverable? Or should this KIP explicitly scope itself to the "all controllers alive but unreachable" case, and leave majority-loss recovery to a separate effort? Thanks, Abhijeet. On Tue, Jun 30, 2026 at 5:46 PM Luke Chen <[email protected]> wrote: > Hi all, > > After more investigation, I think my previous "fallback to bootstrap server > for VOTE, BEGIN_QUORUM_EPOCH..." solution will not work. > Please ignore it. Sorry for the noise. > > Thanks, > Luke > > > On Mon, Jun 29, 2026 at 9:26 PM Luke Chen <[email protected]> wrote: > > > Hi Jose, > > > > Thanks for the response. > > One question from me: > > > > JS5 > > In the "Broker considerations" section you have: > > "It uses these endpoints to connect to the KRaft controller quorum. > > The controller.quorum.bootstrap.servers configuration is not used to > > reach out the controllers." > > > > This is not entirely accurate. Kafka nodes discover the active > > controller using controller.quorum.bootstrap.servers if defined. If it > > is not defined they fall back to using "controller.quorum.voters". In > > general, the endpoints in the voter set are used by the controllers > > (voters) to send KRaft election RPCs like VOTE, BEGIN_QUORUM_EPOCH, > > etc. > > > > => Yes, you're right about this. The FETCH request can rely on > > `controller.quorum.bootstrap.servers` to connect to the leader controller > > to fetch logs. I'm wondering why we can't do similar things as Fetch > > request, to fallback to `controller.quorum.bootstrap.servers`to send > VOTE, > > BEGIN_QUORUM_EPOCH, ... requests to get the updated controller endpoints > > and send them out? Or we can use the response from FETCH request to > update > > the endpoints in voter set cache so that VOTE, BEGIN_QUORUM_EPOCH can be > > used next . If we can do so, we don't have to do this voter set > overriding > > in this KIP. WDYT? > > > > > > Thank you, > > Luke > > > > On Thu, Jun 25, 2026 at 7:38 PM Paolo Patierno <[email protected] > > > > wrote: > > > >> > Re: JS2 > >> > >> I am not sure why you are saying that Strimzi has a limitation and > doesn't > >> provide a stable network identity. > >> Strimzi uses an headless service and all brokers and controllers get a > >> network identity and they are directly reachable with a usual DNS name > >> like, for example, my-cluster-controller-0.my-namespace.svc.dns-domain > >> (which by default is usually something like cluster.local but could be > >> defined by the user depending on the infrastructure). > >> > >> > Re: LC2 > >> > >> About "... some k8s operators format ... " can you provide me more > >> information about which Kafka operators are you referring to? > >> I think that an operator having such behavior as you describe, really > lack > >> of idempotency because in case of a controller rolling, it needs to make > >> distinction if the new pod is starting up as a new controller cluster > (the > >> first one or an additional one) or it's just a rolling because other > >> reasons (i.e. config change, manual restart, ...). Strimzi starts up all > >> the controller nodes together using the bootstrap with multiple > >> controllers > >> which works fine. > >> > >> > JS4 What happens if a snapshot already exists at that log end offset? > >> > >> what do you mean by that? the code look at the end offset of the log, if > >> it's the end offset how a snapshot could already exist? Maybe there is a > >> lack of knowledge on my side here. > >> > >> > JS5 > >> > This is not entirely accurate. Kafka nodes discover the active > >> controller using controller.quorum.bootstrap.servers if defined. If it > >> is not defined they fall back to using "controller.quorum.voters". In > >> general, the endpoints in the voter set are used by the controllers > >> (voters) to send KRaft election RPCs like VOTE, BEGIN_QUORUM_EPOCH, > >> etc. > >> > >> To be honest, it's not what I experienced. If it was this way, my > proposal > >> was totally useless because the Strimzi operator already updates and > roll > >> all the controller nodes with the new > controller.quorum.bootstrap.servers > >> configuration with the new DNS names. But on restarting, each controller > >> is > >> still looking at the VotersRecord with the old DNS names and it's not > >> taking care of the new controller.quorum.bootstrap.servers configuration > >> with the new DNS names. > >> Maybe Luke can confirm (or not?) what I just mentioned. > >> > >> > JS6 > >> > I still don't understand why the Kafka k8s operators can't take > >> advantage of k8s' Headless Service to have multiple DNS names for the > >> same Kafka controller pods. Based on my research this is exactly how > >> the etcd-operator manages etcd clusters hosted by k8s. At a high > >> level, KRaft and etcd have very similar designs and configurations > >> because they are both inspired by Raft. > >> > >> As already mentioned for JS2, Strimzi uses an headless service for the > >> brokers and controllers and they all get a DNS name like, > >> my-cluster-controller-0.my-namespace.svc.dns-domain. > >> I am not sure what you mean by having a headless service with "multiple > >> DNS > >> names", it's not possible. Or I am misleading what you mean. > >> Can you please provide any reference about what you found around > >> etct-operator? Maybe it will help me understanding what you mean. > Thanks! > >> > >> Thanks, > >> Paolo > >> > >> > >> On Thu, 18 Jun 2026 at 21:18, José Armando García Sancio via dev < > >> [email protected]> wrote: > >> > >> > Hi Paolo, > >> > > >> > Re: JS2 > >> > The solution you propose to address Stimiz's limitation—not providing > >> > a stable network layer to the Kafka StatefulSet—is incompatible with > >> > KRaft's replication and dynamic reconfiguration. In short, if the KIP > >> > overrides the voters per node, it will cause diverging states across > >> > the nodes when dynamic reconfiguration is present. > >> > > >> > It is important to distinguish between required and standard > >> > operations like formatting the bootstrapping controller(s), and > >> > dangerous recovery operations like overriding the voter set endpoints > >> > without KRaft's validations and invariants. > >> > > >> > Re: LC2 > >> > As Luke mentioned, we are making a concerted effort to remove the need > >> > to format the Kafka nodes. With KIP-1262 users and k8s operators are > >> > only required to run format on the initial/bootstrapping controllers. > >> > For example, some k8s operators format the kafka cluster by formatting > >> > only one controller with --standalone and then increasing the > >> > controller cluster by adding the other controllers using the > >> > mechanisms provided by KIP-853. > >> > > >> > JS4 > >> > > Create a new snapshot at the current log end offset containing: > >> > What happens if a snapshot already exists at that log end offset? > >> > > >> > JS5 > >> > In the "Broker considerations" section you have: > >> > "It uses these endpoints to connect to the KRaft controller quorum. > >> > The controller.quorum.bootstrap.servers configuration is not used to > >> > reach out the controllers." > >> > > >> > This is not entirely accurate. Kafka nodes discover the active > >> > controller using controller.quorum.bootstrap.servers if defined. If it > >> > is not defined they fall back to using "controller.quorum.voters". In > >> > general, the endpoints in the voter set are used by the controllers > >> > (voters) to send KRaft election RPCs like VOTE, BEGIN_QUORUM_EPOCH, > >> > etc. > >> > > >> > JS6 > >> > I still don't understand why the Kafka k8s operators can't take > >> > advantage of k8s' Headless Service to have multiple DNS names for the > >> > same Kafka controller pods. Based on my research this is exactly how > >> > the etcd-operator manages etcd clusters hosted by k8s. At a high > >> > level, KRaft and etcd have very similar designs and configurations > >> > because they are both inspired by Raft. > >> > > >> > Thanks, > >> > -Jose > >> > > >> > > >> > > >> > On Mon, May 18, 2026 at 9:56 AM Paolo Patierno < > >> [email protected]> > >> > wrote: > >> > > > >> > > Hi all, > >> > > I would like to start a discussion on KIP-1347 > >> > > < > >> > > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1347*3A*Overriding*voter*set*on*storage*formatting__;JSsrKysrKw!!Ayb5sqE7!rJw9_TvAMPlGRNosHTx9GCpbIjQNdzlfi9c0kE28-lbpMJcc4ulXcH089XM47j6eDRhOMwL6aNHMGBkNsvFlg6fOA5o$ > >> > > > >> > > which > >> > > is about allowing the override of the voter set through the storage > >> > > formatting tool to recover a disaster scenario where the KRaft > quorum > >> > can't > >> > > be formed anymore. This KIP aims to fix KAFKA-20427 > >> > > < > >> > > >> > https://urldefense.com/v3/__https://issues.apache.org/jira/browse/KAFKA-20427__;!!Ayb5sqE7!rJw9_TvAMPlGRNosHTx9GCpbIjQNdzlfi9c0kE28-lbpMJcc4ulXcH089XM47j6eDRhOMwL6aNHMGBkNsvFlqZsa23U$ > >> > >. > >> > > Any feedback is very welcome. > >> > > > >> > > Thanks, > >> > > Paolo. > >> > > > >> > > -- > >> > > Paolo Patierno > >> > > > >> > > *Senior Principal Software Engineer @ IBM**CNCF Ambassador* > >> > > > >> > > Twitter : @ppatierno < > >> > > >> > https://urldefense.com/v3/__http://twitter.com/ppatierno__;!!Ayb5sqE7!rJw9_TvAMPlGRNosHTx9GCpbIjQNdzlfi9c0kE28-lbpMJcc4ulXcH089XM47j6eDRhOMwL6aNHMGBkNsvFlfwO9F0Y$ > >> > > > >> > > Linkedin : paolopatierno < > >> > > >> > https://urldefense.com/v3/__http://it.linkedin.com/in/paolopatierno__;!!Ayb5sqE7!rJw9_TvAMPlGRNosHTx9GCpbIjQNdzlfi9c0kE28-lbpMJcc4ulXcH089XM47j6eDRhOMwL6aNHMGBkNsvFlA3NPTbA$ > >> > > > >> > > GitHub : ppatierno < > >> > > >> > https://urldefense.com/v3/__https://github.com/ppatierno__;!!Ayb5sqE7!rJw9_TvAMPlGRNosHTx9GCpbIjQNdzlfi9c0kE28-lbpMJcc4ulXcH089XM47j6eDRhOMwL6aNHMGBkNsvFlPqZuL0A$ > >> > > > >> > > >> > > >> > > >> > -- > >> > -José > >> > > >> > >> > >> -- > >> Paolo Patierno > >> > >> *Senior Principal Software Engineer @ IBM**CNCF Ambassador* > >> > >> Twitter : @ppatierno <http://twitter.com/ppatierno> > >> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > >> GitHub : ppatierno <https://github.com/ppatierno> > >> > > >
