Hi Viquar, Thank you for taking the time to provide such detailed feedback on KIP-1279. I really appreciate your thorough review and the opportunity to clarify several aspects of the design. Let me address each of your points:
VK1 - Thundering Herd and Memory Isolation I want to clarify the throttling mechanisms in the design, as I believe there may be some misunderstanding about how this compares to MM2 in production: Throttling is built-in: The KIP explicitly includes mirror.replication.throttled.rate (broker-level) configuration. This operate identically to intra-cluster replication throttling, which has proven stable at scale for years. We also planning to extend on same protocol in followup KIP for source throttling similar to leader.replication.throttled.rate. MM2 doesn't isolate impact: In production, MM2 acts as a noisy neighbor. During catch-up phases, it saturates network bandwidth, competes for broker disk I/O (source cluster), and creates backpressure on the destination cluster through produce requests. The "separate JVM" doesn't eliminate this—it just moves the memory pressure to a different process that still shares physical resources (network, disk). I'd be interested to hear if you have specific production experience with MM2 that differs from this characterization. VK2 - Blast Radius (Poison Pill) Great question about error isolation. The KIP uses Kafka's standard Fetch protocol, which already handles malformed batches safely: Thread pool isolation: As described in the MirrorFetcherManager section, mirror fetchers run in a dedicated thread pool (mirror.num.replica.fetchers), completely separate from intra-cluster replication threads. Existing precedent: When a ReplicaFetcherThread encounters a corrupt batch today, it marks that partition as failed and continues processing other partitions. The same exception handling applies to MirrorFetcherThread—the partition transitions to FAILED state while other partitions continue mirroring. No node-wide panic: Kafka brokers don't crash when individual fetch threads encounter errors. This is well-tested behavior across billions of partitions in production. Interestingly, the blast radius is actually smaller than MM2, where a malformed batch causes the entire Connect task to restart, potentially disrupting hundreds of partitions simultaneously. VK3 - Control Plane Saturation I think there may be some confusion about how state transitions work in the design. Let me clarify: State transitions during link flaps: Partition state changes (MIRRORING → FAILED → PREPARING → MIRRORING) are written to __mirror_state, not the controller's metadata log. This is a dedicated coordinator topic, analogous to __consumer_offsets. State flaps don't touch the controller. Metadata synchronization: The MirrorMetadataManager queries source cluster metadata on a configurable interval (mirror.metadata.refresh.interval.ms <http://mirror.metadata.refresh.interval.ms/>, default 30s). This is a background operation that doesn't generate huge controller write traffic. Controller interaction: The only controller interaction occurs during addTopicsToMirror/removeTopicsFromMirrorAPI calls, which are operator-driven administrative actions—not automatic responses to link flaps. Could you help me understand what specific 'metadata updates blocking ISR changes' scenario you're concerned about? I want to make sure the design explicitly addresses your use case. VK4 - Transactional Integrity This is an important point, and I think it highlights a common misunderstanding of Kafka's transactional protocol. Let me break down how it works: Consumer-side transaction visibility requires only log-level markers, not coordinator state. What consumers need: The Last Stable Offset (LSO) is computed from control markers (COMMIT/ABORT) in the log. Consumers reading with isolation.level=read_committed only see records up to the LSO. They never interact with the transaction coordinator or validate PIDs. What the KIP does: Mirrors all data records and control markers byte-for-byte During failover (STOPPING → STOPPED transition), truncates to the last mirrored LSO to ensure no incomplete transactions remain Topics are read-only until failover, so no new transactions can be started on the destination Why coordinator state is irrelevant: No producers write to mirrored topics (they're read-only) Control markers are replicated as data, not generated by the destination's coordinator After failover, producers reconnect with new PIDs and transaction IDs There are no 'zombie transactions' because incomplete transactions are truncated before the topic becomes writable. I've added additional clarification on this in the "Transactional Consumer Guarantees" section of the KIP. VK5 - Infinite Loop Prevention I believe this concern relates to MM2's active-active use case, which is outside the scope of this KIP. Let me explain why loops aren't possible in our design: Read-only enforcement: Mirrored topics cannot accept produce requests (they throw ReadOnlyTopicException). This physically prevents A→B→A loops because cluster B cannot write data back to topic A while it's being mirrored. Failover is explicit: The removeTopicsFromMirror API is a deliberate operator action that makes a topic writable. At that point, the topic is no longer mirrored from A, so there's no loop. Bidirectional mirroring ≠ active-active: The KIP supports bidirectional mirroring of different topics (e.g., A mirrors topic-x from B, B mirrors topic-y from A). Same-topic loops are impossible due to read-only semantics, so you can't setup a mirror on an actively mirrored topic. As stated in the 'Active-Active Writes' section, this KIP doesn't attempt to replace MM2 for multi-master scenarios. It's focused on DR/failover/migration where operational simplicity is paramount. MM2's header-based loop detection is necessary precisely because it allows active-active writes. That complexity is a feature of MM2, not a requirement for all replication systems. VK6 - Data Divergence and Epoch Reconciliation The KIP explicitly documents this limitation in the 'Non-Goals: Unclean Leader Election' section. This is a conscious design decision, not an oversight: Clear documentation: The KIP states that when unclean.leader.election.enable=true, brokers log a warning at every sync cycle. Operators are explicitly informed this configuration is unsupported. Shared epoch requirement: Resolving unclean elections across clusters requires synchronous cross-cluster communication to establish a shared leader epoch. This introduces cross-datacenter latency into the critical path of leader elections, which contradicts the asynchronous design principle. Alternative for zero data loss: Operators who require protection against unclean elections should either: Disable unclean leader election (best practice for mission-critical data). Wait for the follow-up KIP on this. Use stretched clusters (though as noted, these provide no DR protection). Fallback behavior: During failback, if the source cluster experienced an unclean election, the LastMirroredOffset API will detect the divergence. The operator must choose to either truncate to the last known good offset (accepting data loss of post-divergence records) or manually reconcile the logs. This is no worse than MM2, which has no mechanism to detect or resolve log divergence from unclean elections. VK7 - Tiered Storage Operational Gaps You're right that a roadmap would be helpful here. The KIP explicitly states: 'Tiered Storage is not initially supported, but a detailed design will be provided in a follow-up KIP.' This is intentional phasing, not an oversight: Complexity separation: Tiered storage support requires changes to fetch semantics (fetching from remote storage instead of local logs) and offset management. Designing this correctly requires dedicated focus. Current workaround: For clusters using tiered storage today, operators can: Mirror only recent data by configuring source cluster retention to keep data in local storage for the required retention period. Use the FAILED state as a signal to manually intervene (e.g., temporarily increase source retention). Roadmap commitment: The 'Future Work' section explicitly lists tiered storage as a follow-up. This is standard KIP practice—core functionality first, extensions later. Comparison: MM2 also doesn't have native tiered storage integration. It fetches from brokers, which serve data from local storage (reading from remote storage when needed). However, MM2 doesn't guarantee the destination topic is also tiered—it's the same limitation here. VK8 - Transactional State and PID Mapping This relates back to VK4. Let me restate the transactional protocol more explicitly to clarify: Three independent components: Transaction Coordinator (source cluster): Tracks active transactions via __transaction_state. Times out hanging transactions. Only matters for producers writing to the source cluster. Log-level markers (mirrored): COMMIT/ABORT markers are appended to topic partitions. These determine the LSO. This is what gets replicated byte-for-byte. Consumer read isolation (destination cluster): Consumers read up to the LSO based on markers in the log. Never queries the transaction coordinator. Why destination doesn't need coordinator state: Mirrored topics are read-only. No producers write to them, so the destination's transaction coordinator never gets involved. When failover occurs, the KIP truncates to the LSO, removing any records without markers. After failover, producers reconnect to the destination cluster with new producer IDs and transaction IDs. The destination's transaction coordinator manages these new transactions normally. PID rewriting (-(sourceProducerId + 2)) is purely to avoid ID conflicts. It doesn't affect transactional semantics because: The LSO calculation uses markers, not PIDs. Consumers validate transaction state via markers, not PIDs. There is no zombie transaction risk because the destination coordinator is never responsible for transactions that originated on the source cluster. Updates to the KIP To help address these questions, I've updated the KIP with: A detailed comparison table between MM2 and KIP-1279 (in the Rejected Alternatives section) Additional clarification in the "Transactional Consumer Guarantees" section I hope this addresses your concerns! I'm happy to discuss any of these points further or hop on a call if that would be helpful. Your feedback is helping make this KIP stronger. Best, Omnia > On 14 Feb 2026, at 20:37, vaquar khan <[email protected]> wrote: > > Hi Fede, > > I reviewed the KIP-1279 proposal yesterday and corrected the KIP number. I > now have time to share my very detailed observations. While I fully support > the goal of removing the operational complexity of Kafka , the design > appears to trade that complexity for broker stability. > > By moving WAN replication into the broker’s core runtime, we are > effectively removing the failure domain isolation that MirrorMaker 2 > provides. We risk coupling the stability of our production clusters to the > instability of cross-datacenter networks.Before this KIP moves to a vote, I > strongly recommend you and other authors to address the following stability > gaps. Without concrete answers here, the risk profile is likely too high > for mission-critical deployments. > > 1. The Thundering Herd and Memory Isolation Risk > In the current architecture, MirrorMaker 2 (MM2) Connect workers provide a > physical failure domain through a separate JVM heap. This isolates the > broker from the memory pressure and Garbage Collection (GC) impact caused > by replication surges. In this proposal, that pressure hits the broker’s > core runtime directly. > > The Gap: We need simulation data for a sustained link outage (e.g., 6 hours > on 10Gbps). When 5,000 partitions resume fetching, does the resulting > backfill I/O and heap pressure cause GC pauses that push P99 Produce > latency on the target cluster over 10ms? We must ensure that a massive > catch-up phase does not starve the broker's Request Handler threads or > destabilize the JVM. > > > 2. Blast Radius (Poison Pill Problem) > The Gap: If a source broker sends a malformed batch (e.g., bit rot), does > it crash the entire broker process? In MM2, this kills a single task. We > need confirmation that exceptions are isolated to the replication thread > pool and will not trigger a node-wide panic. > > 3. Control Plane Saturation > The Gap: How does the system handle a "link flap" event where 50,000 > partitions transition states rapidly? We need to verify that the resulting > flood of metadata updates will not block the Controller from processing > critical ISR changes for local topics. > > 4. Transactional Integrity > "Byte-for-byte" replication copies transaction markers but not the > Coordinator’s state (PIDs). > The Gap: How does the destination broker validate an aborted transaction > without the source PID? We should avoid creating "zombie" transactions that > look valid but cannot be authoritatively managed. > > 5. Infinite Loop Prevention > Since byte-for-byte precludes injecting lineage headers e.g., dc-source, we > lose the standard mechanism for detecting loops in mesh topologies (A→B→A). > The Gap: Relying solely on topic naming conventions is operationally > fragile. What is the deterministic mechanism to prevent infinite recursion? > > 6. Data Divergence and Epoch Reconciliation > The current proposal explicitly excludes support for unclean leader > election because there is no mechanism for a "shared leader epoch" between > clusters. > The Gap: Without epoch reconciliation, if the source cluster experiences an > unclean election, the source and destination logs will diverge. If an > operator later attempts a failback (reverse mirroring), the clusters will > contain inconsistent data for the same offset, leading to potential silent > data corruption or permanent replication failure. > > 7. Tiered Storage Operational Gaps > The design states that Tiered Storage is not initially supported and that a > mirror follower encountering an OffsetMovedToTieredStorageException will > simply mark the partition as FAILED. > The Gap: For mission-critical clusters using Tiered Storage for long-term > retention, this creates an operational cliff. Mirroring will fail as soon > as the source cluster offloads data to remote storage. We need a roadmap > for how native mirroring will eventually interact with tiered segments > without failing the partition. > > 8. Transactional State and PID Mapping > While the KIP proposes a deterministic formula for rewriting Producer IDs > ,calculated as destinationProducerId= (sourceProducerId+2) it does not > replicate the transaction_state metadata. > The Gap: How does the destination broker authoritatively validate or expire > hanging transactions if the source PID state is rewritten but the > transaction coordinator state is missing? > We risk a scenario where consumers encounter zombie transactions that can > never be decided on the destination cluster. > > This is a big change to how our system is built. We need to make sure it > doesn't create a weak link that could bring the whole system down,We should > ensure it does not introduce a new single point of failure. > > Regards, > Viquar Khan > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/ > *Book *- > https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true > *GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/ > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan > *github*-https://github.com/vaquarkhan > > On Sat, 14 Feb 2026 at 01:18, Federico Valeri <[email protected]> wrote: > >> Hi, we would like to start a discussion thread about KIP-1279: Cluster >> Mirroring. >> >> Cluster Mirroring is a new Kafka feature that enables native, >> broker-level topic replication across clusters. Unlike MirrorMaker 2 >> (which runs as an external Connect-based tool), Cluster Mirroring is >> built into the broker itself, allowing tighter integration with the >> controller, coordinator, and partition lifecycle. >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1279%3A+Cluster+Mirroring >> >> There are a few missing bits, but most of the design is there, so we >> think it is the right time to involve the community and get feedback. >> Please help validating our approach. >> >> Thanks >> Fede >>
