when the leader decides to commit a TX (of X msgs, known at this point), it writes an "intent to append X msgs" msg (control?) followed by the X msgs (at this point it is the leader and therefor point of sync, so this can be done with no "foreign" msgs in between). if there's a crash/change of leadership the new leader can just roll back (remove what partial contents it had) if it sees the "intent" msg but dosnt see X msgs belonging to the TX after it. the watermark does not advance into the middle of a TX - so nothing is visible to any consumer until the whole thing is committed and replicated (or crashes and rolled back). which means i dont think TX storage needs replication, and atomicity to consumers is retained.
I cant argue with the latency argument, but: 1. if TXs can be done in-mem maybe TX per-msg isnt that expensive? 2. I think a logical clock approach (with broker-side dedup based on the clock) could provide the same exactly once semantics without requiring transactions at all? however, I concede that as you describe it (long running TXs where commits are actually "checkpoint"s spaced to optimize overhead vs RPO/RTO) you would require read uncommitted to minimize latency. On Tue, Dec 20, 2016 at 1:24 PM, Apurva Mehta <apu...@confluent.io> wrote: > durably at the moment we enter the pre-commit phase. If we > don't have durable persistence of these messages, we can't have idempotent > and atomic copying into the main log, and your proposal to date does not > show otherwise. >