Thanks Dimitry and Jon, answers below > 1) Is a single separate commit log expected to be created for all tables with > the new replication type?
The plan is to still have a single commit log, but only index mutations with a mutation id. > 2) What is a granularity of storing mutation ids in memtable, is it per cell? It would be per-partition > 3) If we update the same row multiple times while it is in a memtable - are > all mutation ids appended to a kind of collection? They would yes. We might be able to do something where we stop tracking mutations that have been superseded by newer mutations (same cells, higher timestamps), but I suspect that would be more trouble than it's worth and would be out of scope for v1. > 4) What is the expected size of a single id? It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc > 5) Do we plan to support multi-table batches (single or multi-partition) for > this replication type? This is intended to support all existing features, however the tracking only happens at the mutation level, so the different mutations coming out of a multi-partition batch would all be tracked individually > So even without repair mucking things up, we're unable to fulfill this > promise except under the specific, ideal circumstance of querying a partition > with only 1 page. It's true that we can't offer multi-page write atomicity without some sort of MVCC. There are a lot of common query patterns that don't involve paging though, so it's not like the benefit of fixing write atomicity would only apply to a small subset of carefully crafted queries or something. Thanks, Blake > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > Very cool! I'll need to spent some time reading this over. One thing I did > notice is this: > > > Cassandra promises partition level write atomicity. This means that, > > although writes are eventually consistent, a given write will either be > > visible or not visible. You're not supposed to see a partially applied > > write. However, read repair and short read protection can both "tear" > > mutations. In the case of read repair, this is because the data resolver > > only evaluates the data included in the client read. So if your read only > > covers a portion of a write that didn't reach a quorum, only that portion > > will be repaired, breaking write atomicity. > > Unfortunately there's more issues with this than just repair. Since we lack > a consistency mechanism like MVCC while paginating, it's possible to do the > following: > > thread A: reads a partition P with 10K rows, starts by reading the first page > thread B: another thread writes a batch to 2 rows in partition P, one on page > 1, another on page 2 > thread A: reads the second page of P which has the mutation. > > I've worked with users who have been surprised by this behavior, because > pagination happens transparently. > > So even without repair mucking things up, we're unable to fulfill this > promise except under the specific, ideal circumstance of querying a partition > with only 1 page. > > Jon > > > > > > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com > <mailto:beggles...@apple.com>> wrote: >> Hello dev@, >> >> We'd like to propose CEP-45: Mutation Tracking for adoption by the >> community. CEP-45 proposes adding a replication mechanism to track and >> reconcile individual mutations, as well as processes to actively reconcile >> missing mutations. >> >> For keyspaces with mutation tracking enabled, the immediate benefits of this >> CEP are: >> * reduced replication lag with a continuous background reconciliation process >> * eliminate the disk load caused by repair merkle tree calculation >> * eliminate repair overstreaming >> * reduce disk load of reads on cluster to close to 1/CL >> * fix longstanding mutation atomicity issues caused by read repair and short >> read protection >> >> Additionally, although it's outside the scope of this CEP, mutation tracking >> would enable: >> * completion of witness replicas / transient replication, making the feature >> usable for all workloads >> * lightweight witness only datacenters >> >> The CEP is linked here: >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, >> but please keep the discussion on the dev list. >> >> Thanks! >> >> Blake Eggleston