Re: [DISCUSS] KIP-1279: Cluster Mirroring

vaquar khan Fri, 20 Feb 2026 11:42:47 -0800

Hi Omnia,

Thank you for the detailed response and update KIP . I went through your
answers and cross-referenced them with the codebase and our architectural
analysis. You make fair points on several of these, so I want to cover
everything to ensure we are on the same page. We are aligned on about half
the list, but your clarifications on the architecture actually highlight a
few deeper code-level blockers that we absolutely have to fix before this
goes to a vote.


Here is my complete feedback on all the issues discussed, plus a few
protocol gaps I found during the code review.


*The Aligned Items (Resolved or Accepted Limitations)*
*VK2 - Blast Radius (Poison Pill):* I agree with you here. The thread pool
isolation in the fetcher manager handles the blast radius well. Since
exceptions only transition the specific partition to FAILED without killing
the fetcher thread, this is resolved.

*VK5 - Infinite Loop Prevention: *Your explanation makes sense. Since
KIP-1279 strictly enforces read-only semantics on the destination topics,
active-active loops are architecturally prevented. Resolved.

*VK6 - Data Divergence and Epoch Reconciliation:* I accept this as a
documented limitation. As long as operators are aware that enabling unclean
leader election risks permanent log divergence that this tool cannot
reconcile, we can move forward.

*VK7 - Tiered Storage Operational Gaps:* I also accept this as a documented
limitation for the initial release, pending the follow-up KIP. Operators
will just need to configure local log retention carefully to avoid the
operational cliff.


*The Critical Blockers & Operational Risks*
*VK4 & VK8 - Transactional Integrity & PID Formula Bug (Critical): *You are
perfectly correct that we do not need to replicate the full
_transaction_state from the source Coordinator, and that consumers rely on
log markers. However, the destination broker still needs to track these
transactions in memory to calculate the Last Stable Offset (LSO). This is
exactly where the design breaks.

I looked at AbstractRecordBatch.java, and hasProducerId() strictly checks
if the ID is greater than -1. Your formula -(PID + 2) spits out negative
numbers. Because the check fails, the broker skips
ProducerStateManager.update(), the ongoingTxns map remains empty, and the
LSO defaults to the High Watermark. Read_committed consumers will
immediately read uncommitted data. We must shift this formula into the
positive long space (e.g., Long.MAX_VALUE - sourceProducerId - 1) to pass
these internal checks.

*VK11 - TransactionIndex Rebuild Ambiguity (Critical):* This is a direct
continuation of the blast radius from VK8. The KIP does not specify whether
the TransactionIndex is rebuilt locally from the mirrored log or copied
byte-for-byte from the source. If it is copied, the index will contain the
original source PIDs, while the destination log contains our rewritten
PIDs. Because they will not match, consumers will fail the abort check and
end up reading aborted messages. We need to explicitly mandate in the KIP
that the TransactionIndex must be rebuilt locally.

*VK1 - Thundering Herd and Memory Isolation: *I hear you on bandwidth
throttling, but we are dealing with two different bottlenecks. My worry is
not the network rate; it is heap allocation. If 50k partitions wake up and
fetcher threads eagerly grab the default 1MB buffer for each, that is an
instant 50GB RAM demand. Rate limits slow down the transfer, but they do
not stop threads from claiming buffer space upfront. Unlike MM2 where an
OOM just kills a worker, an OOM here takes down the whole broker. We need a
way to cap the number of concurrent active fetch buffers rather than just
relying on bytes per second.

*VK3 - Control Plane Hotspots: *While the global Controller is safe (which
addresses my initial concern), we are essentially just shifting the load to
the _mirror_state topic. If a link flaps and 50k partitions transition
states, we will hammer whichever broker hosts the leader for that
coordinator topic. If it only has 50 partitions and relies on static
hashing, a single broker gets crushed. Do we have sizing guidelines or a
round-robin partition assignment strategy to avoid this during an outage?

*VK12 - Offset Sync Race Condition and Data Loss:* The KIP notes that
synced offsets can exceed the destination log end offset temporarily. If a
failover happens during that window, consumers hit an OffsetOutOfRange
exception. The KIP's suggested mitigation is to set
auto.offset.reset=latest. However, doing this causes actual data loss,
because the consumer will skip from the log end offset to the newly
committed offset. We should instead cap synced offsets to
min(sourceCommitted, destinationLEO) during the sync process to prevent
data loss.


*New Findings (Performance & Documentation)*
*VK9 - CRC Recalculation and Zero-Copy Loss:* I found a new performance hit
while tracing the PID rewrite for VK8. The CRC32C checksum covers the
Producer ID field, so rewriting the PID invalidates the checksum. This
forces the mirror fetcher to pull the full batch into user space, modify
the ID, and recalculate the CRC. It completely kills zero-copy sendfile
transfers and turns mirroring into a heavily CPU-bound workload. Does the
KIP throughput estimate factor in this extra CPU overhead?

*VK10 - Share and Consumer Group ID Collision:* Just a heads up on the
operational side here. The KIP notes that if a consumer group and share
group share a name, the offset commit fails with GroupIdNotFoundException.
Finding this out during a high-stress production failover is a terrible
operator experience. Can we add a pre-migration validation command to the
CLI to catch naming conflicts before mirroring even starts?

*VK13 - Epoch Handling Documentation: *Minor documentation note: the epoch
handling for the Partition Leader Epoch vs. the Producer Epoch is specified
in two separate sections. It might be worth consolidating them into one
section to avoid any implementation confusion down the line.

I know this is a massive list to digest, but I want to make sure we get
this architecture solid. Getting the PID logic, the TransactionIndex
rebuild, and the offset sync specified correctly feels like the biggest
priority right now to ensure data integrity.

Regards,
Viquar Khan
*Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
*Book *-
https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
*GitBook*-https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
*Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
*github*-https://github.com/vaquarkhan

On Wed, 18 Feb 2026 at 11:34, Omnia Ibrahim <[email protected]> wrote:

> Hi Viquar,
> Thank you for taking the time to provide such detailed feedback on
> KIP-1279. I really appreciate your thorough review and the opportunity to
> clarify several aspects of the design. Let me address each of your points:
>
> VK1 - Thundering Herd and Memory Isolation
> I want to clarify the throttling mechanisms in the design, as I believe
> there may be some misunderstanding about how this compares to MM2 in
> production:
> Throttling is built-in: The KIP explicitly includes
> mirror.replication.throttled.rate (broker-level) configuration. This
> operate identically to intra-cluster replication throttling, which has
> proven stable at scale for years. We also planning to extend on same
> protocol in followup KIP for source throttling similar to
> leader.replication.throttled.rate.
> MM2 doesn't isolate impact: In production, MM2 acts as a noisy neighbor.
> During catch-up phases, it saturates network bandwidth, competes for broker
> disk I/O (source cluster), and creates backpressure on the destination
> cluster through produce requests. The "separate JVM" doesn't eliminate
> this—it just moves the memory pressure to a different process that still
> shares physical resources (network, disk).
> I'd be interested to hear if you have specific production experience with
> MM2 that differs from this characterization.
>
> VK2 - Blast Radius (Poison Pill)
> Great question about error isolation. The KIP uses Kafka's standard Fetch
> protocol, which already handles malformed batches safely:
> Thread pool isolation: As described in the MirrorFetcherManager section,
> mirror fetchers run in a dedicated thread pool
> (mirror.num.replica.fetchers), completely separate from intra-cluster
> replication threads.
> Existing precedent: When a ReplicaFetcherThread encounters a corrupt batch
> today, it marks that partition as failed and continues processing other
> partitions. The same exception handling applies to MirrorFetcherThread—the
> partition transitions to FAILED state while other partitions continue
> mirroring.
> No node-wide panic: Kafka brokers don't crash when individual fetch
> threads encounter errors. This is well-tested behavior across billions of
> partitions in production.
>
> Interestingly, the blast radius is actually smaller than MM2, where a
> malformed batch causes the entire Connect task to restart, potentially
> disrupting hundreds of partitions simultaneously.
>
> VK3 - Control Plane Saturation
> I think there may be some confusion about how state transitions work in
> the design. Let me clarify:
> State transitions during link flaps: Partition state changes (MIRRORING →
> FAILED → PREPARING → MIRRORING) are written to __mirror_state, not the
> controller's metadata log. This is a dedicated coordinator topic, analogous
> to __consumer_offsets. State flaps don't touch the controller.
> Metadata synchronization: The MirrorMetadataManager queries source cluster
> metadata on a configurable interval (mirror.metadata.refresh.interval.ms <
> http://mirror.metadata.refresh.interval.ms/>, default 30s). This is a
> background operation that doesn't generate huge controller write traffic.
> Controller interaction: The only controller interaction occurs during
> addTopicsToMirror/removeTopicsFromMirrorAPI calls, which are
> operator-driven administrative actions—not automatic responses to link
> flaps.
> Could you help me understand what specific 'metadata updates blocking ISR
> changes' scenario you're concerned about? I want to make sure the design
> explicitly addresses your use case.
>
> VK4 - Transactional Integrity
> This is an important point, and I think it highlights a common
> misunderstanding of Kafka's transactional protocol. Let me break down how
> it works:
> Consumer-side transaction visibility requires only log-level markers, not
> coordinator state.
> What consumers need: The Last Stable Offset (LSO) is computed from control
> markers (COMMIT/ABORT) in the log. Consumers reading with
> isolation.level=read_committed only see records up to the LSO. They never
> interact with the transaction coordinator or validate PIDs.
>
>
> What the KIP does:
> Mirrors all data records and control markers byte-for-byte
> During failover (STOPPING → STOPPED transition), truncates to the last
> mirrored LSO to ensure no incomplete transactions remain
> Topics are read-only until failover, so no new transactions can be started
> on the destination
> Why coordinator state is irrelevant:
> No producers write to mirrored topics (they're read-only)
> Control markers are replicated as data, not generated by the destination's
> coordinator
> After failover, producers reconnect with new PIDs and transaction IDs
> There are no 'zombie transactions' because incomplete transactions are
> truncated before the topic becomes writable. I've added additional
> clarification on this in the "Transactional Consumer Guarantees" section of
> the KIP.
>
> VK5 - Infinite Loop Prevention
> I believe this concern relates to MM2's active-active use case, which is
> outside the scope of this KIP. Let me explain why loops aren't possible in
> our design:
> Read-only enforcement: Mirrored topics cannot accept produce requests
> (they throw ReadOnlyTopicException). This physically prevents A→B→A loops
> because cluster B cannot write data back to topic A while it's being
> mirrored.
>  Failover is explicit: The removeTopicsFromMirror API is a deliberate
> operator action that makes a topic writable. At that point, the topic is no
> longer mirrored from A, so there's no loop.
> Bidirectional mirroring ≠ active-active: The KIP supports bidirectional
> mirroring of different topics (e.g., A mirrors topic-x from B, B mirrors
> topic-y from A). Same-topic loops are impossible due to read-only
> semantics, so you can't setup a mirror on an actively mirrored topic.
> As stated in the 'Active-Active Writes' section, this KIP doesn't attempt
> to replace MM2 for multi-master scenarios. It's focused on
> DR/failover/migration where operational simplicity is paramount.
> MM2's header-based loop detection is necessary precisely because it allows
> active-active writes. That complexity is a feature of MM2, not a
> requirement for all replication systems.
>
> VK6 - Data Divergence and Epoch Reconciliation
> The KIP explicitly documents this limitation in the 'Non-Goals: Unclean
> Leader Election' section. This is a conscious design decision, not an
> oversight:
> Clear documentation: The KIP states that when
> unclean.leader.election.enable=true, brokers log a warning at every sync
> cycle. Operators are explicitly informed this configuration is unsupported.
> Shared epoch requirement: Resolving unclean elections across clusters
> requires synchronous cross-cluster communication to establish a shared
> leader epoch. This introduces cross-datacenter latency into the critical
> path of leader elections, which contradicts the asynchronous design
> principle.
> Alternative for zero data loss: Operators who require protection against
> unclean elections should either:
> Disable unclean leader election (best practice for mission-critical data).
> Wait for the follow-up KIP on this.
> Use stretched clusters (though as noted, these provide no DR protection).
>
>
> Fallback behavior: During failback, if the source cluster experienced an
> unclean election, the LastMirroredOffset API will detect the divergence.
> The operator must choose to either truncate to the last known good offset
> (accepting data loss of post-divergence records) or manually reconcile the
> logs.
>
> This is no worse than MM2, which has no mechanism to detect or resolve log
> divergence from unclean elections.
>
> VK7 - Tiered Storage Operational Gaps
> You're right that a roadmap would be helpful here. The KIP explicitly
> states: 'Tiered Storage is not initially supported, but a detailed design
> will be provided in a follow-up KIP.'
> This is intentional phasing, not an oversight:
> Complexity separation: Tiered storage support requires changes to fetch
> semantics (fetching from remote storage instead of local logs) and offset
> management. Designing this correctly requires dedicated focus.
> Current workaround: For clusters using tiered storage today, operators can:
> Mirror only recent data by configuring source cluster retention to keep
> data in local storage for the required retention period.
> Use the FAILED state as a signal to manually intervene (e.g., temporarily
> increase source retention).
> Roadmap commitment: The 'Future Work' section explicitly lists tiered
> storage as a follow-up. This is standard KIP practice—core functionality
> first, extensions later.
>
>
> Comparison: MM2 also doesn't have native tiered storage integration. It
> fetches from brokers, which serve data from local storage (reading from
> remote storage when needed). However, MM2 doesn't guarantee the destination
> topic is also tiered—it's the same limitation here.
>
> VK8 - Transactional State and PID Mapping
> This relates back to VK4. Let me restate the transactional protocol more
> explicitly to clarify:
> Three independent components:
> Transaction Coordinator (source cluster):
> Tracks active transactions via __transaction_state.
> Times out hanging transactions.
> Only matters for producers writing to the source cluster.
> Log-level markers (mirrored):
> COMMIT/ABORT markers are appended to topic partitions.
> These determine the LSO.
> This is what gets replicated byte-for-byte.
> Consumer read isolation (destination cluster):
> Consumers read up to the LSO based on markers in the log.
> Never queries the transaction coordinator.
>
>
> Why destination doesn't need coordinator state:
> Mirrored topics are read-only. No producers write to them, so the
> destination's transaction coordinator never gets involved.
> When failover occurs, the KIP truncates to the LSO, removing any records
> without markers.
> After failover, producers reconnect to the destination cluster with new
> producer IDs and transaction IDs. The destination's transaction coordinator
> manages these new transactions normally.
> PID rewriting (-(sourceProducerId + 2)) is purely to avoid ID conflicts.
> It doesn't affect transactional semantics because:
> The LSO calculation uses markers, not PIDs.
> Consumers validate transaction state via markers, not PIDs.
> There is no zombie transaction risk because the destination coordinator is
> never responsible for transactions that originated on the source cluster.
>
>
>
> Updates to the KIP
> To help address these questions, I've updated the KIP with:
> A detailed comparison table between MM2 and KIP-1279 (in the Rejected
> Alternatives section)
> Additional clarification in the "Transactional Consumer Guarantees" section
> I hope this addresses your concerns! I'm happy to discuss any of these
> points further or hop on a call if that would be helpful. Your feedback is
> helping make this KIP stronger.
>
> Best,
> Omnia
>
> > On 14 Feb 2026, at 20:37, vaquar khan <[email protected]> wrote:
> >
> > Hi Fede,
> >
> > I reviewed the KIP-1279 proposal yesterday and corrected the KIP number.
> I
> > now have time to share my very detailed observations. While I fully
> support
> > the goal of removing the operational complexity of Kafka , the design
> > appears to trade that complexity for broker stability.
> >
> > By moving WAN replication into the broker’s core runtime, we are
> > effectively removing the failure domain isolation that MirrorMaker 2
> > provides. We risk coupling the stability of our production clusters to
> the
> > instability of cross-datacenter networks.Before this KIP moves to a
> vote, I
> > strongly recommend you and other authors to address the following
> stability
> > gaps. Without concrete answers here, the risk profile is likely too high
> > for mission-critical deployments.
> >
> > 1. The Thundering Herd and Memory Isolation Risk
> > In the current architecture, MirrorMaker 2 (MM2) Connect workers provide
> a
> > physical failure domain through a separate JVM heap. This isolates the
> > broker from the memory pressure and Garbage Collection (GC) impact caused
> > by replication surges. In this proposal, that pressure hits the broker’s
> > core runtime directly.
> >
> > The Gap: We need simulation data for a sustained link outage (e.g., 6
> hours
> > on 10Gbps). When 5,000 partitions resume fetching, does the resulting
> > backfill I/O and heap pressure cause GC pauses that push P99 Produce
> > latency on the target cluster over 10ms? We must ensure that a massive
> > catch-up phase does not starve the broker's Request Handler threads or
> > destabilize the JVM.
> >
> >
> > 2. Blast Radius (Poison Pill  Problem)
> > The Gap: If a source broker sends a malformed batch (e.g., bit rot), does
> > it crash the entire broker process? In MM2, this kills a single task. We
> > need confirmation that exceptions are isolated to the replication thread
> > pool and will not trigger a node-wide panic.
> >
> > 3. Control Plane Saturation
> > The Gap: How does the system handle a "link flap" event where 50,000
> > partitions transition states rapidly? We need to verify that the
> resulting
> > flood of metadata updates will not block the Controller from processing
> > critical ISR changes for local topics.
> >
> > 4. Transactional Integrity
> > "Byte-for-byte" replication copies transaction markers but not the
> > Coordinator’s state (PIDs).
> > The Gap: How does the destination broker validate an aborted transaction
> > without the source PID? We should avoid creating "zombie" transactions
> that
> > look valid but cannot be authoritatively managed.
> >
> > 5. Infinite Loop Prevention
> > Since byte-for-byte precludes injecting lineage headers e.g., dc-source,
> we
> > lose the standard mechanism for detecting loops in mesh topologies
> (A→B→A).
> > The Gap: Relying solely on topic naming conventions is operationally
> > fragile. What is the deterministic mechanism to prevent infinite
> recursion?
> >
> > 6. Data Divergence and Epoch Reconciliation
> > The current proposal explicitly excludes support for unclean leader
> > election because there is no mechanism for a "shared leader epoch"
> between
> > clusters.
> > The Gap: Without epoch reconciliation, if the source cluster experiences
> an
> > unclean election, the source and destination logs will diverge. If an
> > operator later attempts a failback (reverse mirroring), the clusters will
> > contain inconsistent data for the same offset, leading to potential
> silent
> > data corruption or permanent replication failure.
> >
> > 7. Tiered Storage Operational Gaps
> > The design states that Tiered Storage is not initially supported and
> that a
> > mirror follower encountering an OffsetMovedToTieredStorageException will
> > simply mark the partition as FAILED.
> > The Gap: For mission-critical clusters using Tiered Storage for long-term
> > retention, this creates an operational cliff. Mirroring will fail as soon
> > as the source cluster offloads data to remote storage. We need a roadmap
> > for how native mirroring will eventually interact with tiered segments
> > without failing the partition.
> >
> > 8. Transactional State and PID Mapping
> > While the KIP proposes a deterministic formula for rewriting Producer IDs
> > ,calculated as destinationProducerId= (sourceProducerId+2) it does not
> > replicate the transaction_state metadata.
> > The Gap: How does the destination broker authoritatively validate or
> expire
> > hanging transactions if the source PID state is rewritten but the
> > transaction coordinator state is missing?
> > We risk a scenario where consumers encounter zombie transactions that can
> > never be decided on the destination cluster.
> >
> > This is a big change to how our system is built. We need to make sure it
> > doesn't create a weak link that could bring the whole system down,We
> should
> > ensure it does not introduce a new single point of failure.
> >
> > Regards,
> > Viquar Khan
> > *Linkedin *-https://www.linkedin.com/in/vaquar-khan-b695577/
> > *Book *-
> >
> https://us.amazon.com/stores/Vaquar-Khan/author/B0DMJCG9W6?ref=ap_rdr&shoppingPortalEnabled=true
> > *GitBook*-
> https://vaquarkhan.github.io/microservices-recipes-a-free-gitbook/
> > *Stack *-https://stackoverflow.com/users/4812170/vaquar-khan
> > *github*-https://github.com/vaquarkhan
> >
> > On Sat, 14 Feb 2026 at 01:18, Federico Valeri <[email protected]>
> wrote:
> >
> >> Hi, we would like to start a discussion thread about KIP-1279: Cluster
> >> Mirroring.
> >>
> >> Cluster Mirroring is a new Kafka feature that enables native,
> >> broker-level topic replication across clusters. Unlike MirrorMaker 2
> >> (which runs as an external Connect-based tool), Cluster Mirroring is
> >> built into the broker itself, allowing tighter integration with the
> >> controller, coordinator, and partition lifecycle.
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1279%3A+Cluster+Mirroring
> >>
> >> There are a few missing bits, but most of the design is there, so we
> >> think it is the right time to involve the community and get feedback.
> >> Please help validating our approach.
> >>
> >> Thanks
> >> Fede
> >>
>
>

Re: [DISCUSS] KIP-1279: Cluster Mirroring

Reply via email to