Hi Fede, I reviewed the KIP-1279 proposal yesterday and corrected the KIP number. I now have time to share my very detailed observations. While I fully support the goal of removing the operational complexity of Kafka , the design appears to trade that complexity for broker stability.
By moving WAN replication into the broker’s core runtime, we are effectively removing the failure domain isolation that MirrorMaker 2 provides. We risk coupling the stability of our production clusters to the instability of cross-datacenter networks.Before this KIP moves to a vote, I strongly recommend you and other authors to address the following stability gaps. Without concrete answers here, the risk profile is likely too high for mission-critical deployments. 1. The Thundering Herd and Memory Isolation Risk In the current architecture, MirrorMaker 2 (MM2) Connect workers provide a physical failure domain through a separate JVM heap. This isolates the broker from the memory pressure and Garbage Collection (GC) impact caused by replication surges. In this proposal, that pressure hits the broker’s core runtime directly. The Gap: We need simulation data for a sustained link outage (e.g., 6 hours on 10Gbps). When 5,000 partitions resume fetching, does the resulting backfill I/O and heap pressure cause GC pauses that push P99 Produce latency on the target cluster over 10ms? We must ensure that a massive catch-up phase does not starve the broker's Request Handler threads or destabilize the JVM. 2. Blast Radius (Poison Pill Problem) The Gap: If a source broker sends a malformed batch (e.g., bit rot), does it crash the entire broker process? In MM2, this kills a single task. We need confirmation that exceptions are isolated to the replication thread pool and will not trigger a node-wide panic. 3. Control Plane Saturation The Gap: How does the system handle a "link flap" event where 50,000 partitions transition states rapidly? We need to verify that the resulting flood of metadata updates will not block the Controller from processing critical ISR changes for local topics. 4. Transactional Integrity "Byte-for-byte" replication copies transaction markers but not the Coordinator’s state (PIDs). The Gap: How does the destination broker validate an aborted transaction without the source PID? We should avoid creating "zombie" transactions that look valid but cannot be authoritatively managed. 5. Infinite Loop Prevention Since byte-for-byte precludes injecting lineage headers e.g., dc-source, we lose the standard mechanism for detecting loops in mesh topologies (A→B→A). The Gap: Relying solely on topic naming conventions is operationally fragile. What is the deterministic mechanism to prevent infinite recursion? 6. Data Divergence and Epoch Reconciliation The current proposal explicitly excludes support for unclean leader election because there is no mechanism for a "shared leader epoch" between clusters. The Gap: Without epoch reconciliation, if the source cluster experiences an unclean election, the source and destination logs will diverge. If an operator later attempts a failback (reverse mirroring), the clusters will contain inconsistent data for the same offset, leading to potential silent data corruption or permanent replication failure. 7. Tiered Storage Operational Gaps The design states that Tiered Storage is not initially supported and that a mirror follower encountering an OffsetMovedToTieredStorageException will simply mark the partition as FAILED. The Gap: For mission-critical clusters using Tiered Storage for long-term retention, this creates an operational cliff. Mirroring will fail as soon as the source cluster offloads data to remote storage. We need a roadmap for how native mirroring will eventually interact with tiered segments without failing the partition. 8. Transactional State and PID Mapping While the KIP proposes a deterministic formula for rewriting Producer IDs ,calculated as destinationProducerId= (sourceProducerId+2) it does not replicate the transaction_state metadata. The Gap: How does the destination broker authoritatively validate or expire hanging transactions if the source PID state is rewritten but the transaction coordinator state is missing? We risk a scenario where consumers encounter zombie transactions that can never be decided on the destination cluster. This is a big change to how our system is built. We need to make sure it doesn't create a weak link that could bring the whole system down,We should ensure it does not introduce a new single point of failure. Regards, Viquar Khan *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ *Book *- https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true *GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan *github*-https://github.com/vaquarkhan On Sat, 14 Feb 2026 at 01:18, Federico Valeri <[email protected]> wrote: > Hi, we would like to start a discussion thread about KIP-1279: Cluster > Mirroring. > > Cluster Mirroring is a new Kafka feature that enables native, > broker-level topic replication across clusters. Unlike MirrorMaker 2 > (which runs as an external Connect-based tool), Cluster Mirroring is > built into the broker itself, allowing tighter integration with the > controller, coordinator, and partition lifecycle. > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1279%3A+Cluster+Mirroring > > There are a few missing bits, but most of the design is there, so we > think it is the right time to involve the community and get feedback. > Please help validating our approach. > > Thanks > Fede >
