Your pagination case is not a violation of any guarantees Cassandra makes. It has never made guarantees across multiple queries.
Trying to have MVCC/consistent data across multiple queries is a very different issue/problem from this CEP.  If you want to have a discussion about MVCC I suggest creating a new thread.

-Jeremiah

On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:


> It's true that we can't offer multi-page write atomicity without some sort of MVCC. There are a lot of common query patterns that don't involve paging though, so it's not like the benefit of fixing write atomicity would only apply to a small subset of carefully crafted queries or something.

Sure, it'll work a lot, but we don't say "partition level write atomicity some of the time".  We say guarantee.  From the CEP:

> In the case of read repair, since we are only reading and correcting the parts of a partition that we're reading and not the entire contents of a partition on each read, read repair can break our guarantee on partition level write atomicity. This approach also prevents meeting the monotonic read requirement for witness replicas, which has significantly limited its usefulness.

I point this out because it's not well known, and we make a guarantee that isn't true, and while the CEP will reduce the number of cases in which we violate the guarantee, we will still have known edge cases that it doesn't hold up.  So we should stop saying it. 




On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> wrote:
Thanks Dimitry and Jon, answers below

1) Is a single separate commit log expected to be created for all tables with the new replication type?

The plan is to still have a single commit log, but only index mutations with a mutation id. 

2) What is a granularity of storing mutation ids in memtable, is it per cell?

It would be per-partition

3) If we update the same row multiple times while it is in a memtable - are all mutation ids appended to a kind of collection?

They would yes. We might be able to do something where we stop tracking mutations that have been superseded by newer mutations (same cells, higher timestamps), but I suspect that would be more trouble than it's worth and would be out of scope for v1.

4) What is the expected size of a single id?

It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc

5) Do we plan to support multi-table batches (single or multi-partition) for this replication type?

This is intended to support all existing features, however the tracking only happens at the mutation level, so the different mutations coming out of a multi-partition batch would all be tracked individually

So even without repair mucking things up, we're unable to fulfill this promise except under the specific, ideal circumstance of querying a partition with only 1 page.

It's true that we can't offer multi-page write atomicity without some sort of MVCC. There are a lot of common query patterns that don't involve paging though, so it's not like the benefit of fixing write atomicity would only apply to a small subset of carefully crafted queries or something.

Thanks,

Blake

On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:

Very cool!  I'll need to spent some time reading this over.  One thing I did notice is this:

> Cassandra promises partition level write atomicity. This means that, although writes are eventually consistent, a given write will either be visible or not visible. You're not supposed to see a partially applied write. However, read repair and short read protection can both "tear" mutations. In the case of read repair, this is because the data resolver only evaluates the data included in the client read. So if your read only covers a portion of a write that didn't reach a quorum, only that portion will be repaired, breaking write atomicity.

Unfortunately there's more issues with this than just repair.  Since we lack a consistency mechanism like MVCC while paginating, it's possible to do the following:

thread A: reads a partition P with 10K rows, starts by reading the first page
thread B: another thread writes a batch to 2 rows in partition P, one on page 1, another on page 2
thread A: reads the second page of P which has the mutation.

I've worked with users who have been surprised by this behavior, because pagination happens transparently.

So even without repair mucking things up, we're unable to fulfill this promise except under the specific, ideal circumstance of querying a partition with only 1 page.

Jon





On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> wrote:
Hello dev@,

We'd like to propose CEP-45: Mutation Tracking for adoption by the community. CEP-45 proposes adding a replication mechanism to track and reconcile individual mutations, as well as processes to actively reconcile missing mutations.

For keyspaces with mutation tracking enabled, the immediate benefits of this CEP are:
* reduced replication lag with a continuous background reconciliation process
* eliminate the disk load caused by repair merkle tree calculation
* eliminate repair overstreaming
* reduce disk load of reads on cluster to close to 1/CL
* fix longstanding mutation atomicity issues caused by read repair and short read protection

Additionally, although it's outside the scope of this CEP, mutation tracking would enable:
* completion of witness replicas / transient replication, making the feature usable for all workloads
* lightweight witness only datacenters

The CEP is linked here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, but please keep the discussion on the dev list.

Thanks!

Blake Eggleston

Reply via email to