Thanks Dimitry and Jon, answers below

> 1) Is a single separate commit log expected to be created for all tables with 
> the new replication type?

The plan is to still have a single commit log, but only index mutations with a 
mutation id. 

> 2) What is a granularity of storing mutation ids in memtable, is it per cell?

It would be per-partition

> 3) If we update the same row multiple times while it is in a memtable - are 
> all mutation ids appended to a kind of collection?

They would yes. We might be able to do something where we stop tracking 
mutations that have been superseded by newer mutations (same cells, higher 
timestamps), but I suspect that would be more trouble than it's worth and would 
be out of scope for v1.

> 4) What is the expected size of a single id?

It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc

> 5) Do we plan to support multi-table batches (single or multi-partition) for 
> this replication type?


This is intended to support all existing features, however the tracking only 
happens at the mutation level, so the different mutations coming out of a 
multi-partition batch would all be tracked individually

> So even without repair mucking things up, we're unable to fulfill this 
> promise except under the specific, ideal circumstance of querying a partition 
> with only 1 page.


It's true that we can't offer multi-page write atomicity without some sort of 
MVCC. There are a lot of common query patterns that don't involve paging 
though, so it's not like the benefit of fixing write atomicity would only apply 
to a small subset of carefully crafted queries or something.

Thanks,

Blake

> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> 
> Very cool!  I'll need to spent some time reading this over.  One thing I did 
> notice is this:
> 
> > Cassandra promises partition level write atomicity. This means that, 
> > although writes are eventually consistent, a given write will either be 
> > visible or not visible. You're not supposed to see a partially applied 
> > write. However, read repair and short read protection can both "tear" 
> > mutations. In the case of read repair, this is because the data resolver 
> > only evaluates the data included in the client read. So if your read only 
> > covers a portion of a write that didn't reach a quorum, only that portion 
> > will be repaired, breaking write atomicity.
> 
> Unfortunately there's more issues with this than just repair.  Since we lack 
> a consistency mechanism like MVCC while paginating, it's possible to do the 
> following:
> 
> thread A: reads a partition P with 10K rows, starts by reading the first page
> thread B: another thread writes a batch to 2 rows in partition P, one on page 
> 1, another on page 2
> thread A: reads the second page of P which has the mutation.
> 
> I've worked with users who have been surprised by this behavior, because 
> pagination happens transparently.
> 
> So even without repair mucking things up, we're unable to fulfill this 
> promise except under the specific, ideal circumstance of querying a partition 
> with only 1 page.
> 
> Jon
> 
> 
> 
> 
> 
> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com 
> <mailto:beggles...@apple.com>> wrote:
>> Hello dev@,
>> 
>> We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>> community. CEP-45 proposes adding a replication mechanism to track and 
>> reconcile individual mutations, as well as processes to actively reconcile 
>> missing mutations.
>> 
>> For keyspaces with mutation tracking enabled, the immediate benefits of this 
>> CEP are:
>> * reduced replication lag with a continuous background reconciliation process
>> * eliminate the disk load caused by repair merkle tree calculation
>> * eliminate repair overstreaming
>> * reduce disk load of reads on cluster to close to 1/CL
>> * fix longstanding mutation atomicity issues caused by read repair and short 
>> read protection
>> 
>> Additionally, although it's outside the scope of this CEP, mutation tracking 
>> would enable:
>> * completion of witness replicas / transient replication, making the feature 
>> usable for all workloads
>> * lightweight witness only datacenters
>> 
>> The CEP is linked here: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>  but please keep the discussion on the dev list.
>> 
>> Thanks!
>> 
>> Blake Eggleston

Reply via email to