Re: [DISCUSS] CEP-45: Mutation Tracking

Josh McKenzie Thu, 16 Jan 2025 10:08:34 -0800

> The other issue is that there isn’t a time bound on the paging payload, so if 
> the application is taking long enough between pages that the log has been 
> truncated, we’d have to throw an exception.
My hot-take is that this relationship between how long you're taking to page, 
how much data you're processing / getting back, and ingest / flushing frequency 
all combined leading to unpredictable exceptions would be a bad default from a 
UX perspective compared to a default of "a single page of data has atomicity; 
multiple pages do not". Maybe it's just because that's been our default for so 
long.


The simplicity of having a flag that's "don't make my pages atomic and they 
always return vs. make my pages atomic and throw exceptions if the metadata I 
need is yoinked while I page" is pretty attractive to me.

Really interesting thought, using these logs as "partial MVCC" while they're 
available specifically for what could/should be a very tight timeline use-case 
(paging).

On Thu, Jan 16, 2025, at 12:41 PM, Jake Luciani wrote:
> This is very cool!
> 
> I have done a POC that was similar but more akin to Aurora paper
> whereby the commitlog itself would repair itself from peers
> proactively using the seekable commitlog.
> 
> Can you explain the reason you prefer to reconcile on read?  Having a
> consistent commitlog would solve so many problems like CDC, PITR, MVs
> etc.
> 
> Jake
> 
> On Thu, Jan 16, 2025 at 12:13 PM Blake Eggleston <beggles...@apple.com> wrote:
> >
> > I’ve been thinking about the paging atomicity issue. I think it could be 
> > fixed with mutation tracking and without having to support full on MVCC.
> >
> > When we reach a page boundary, we can send the highest mutation id we’ve 
> > seen for the partition we reached the paging boundary on. When we request 
> > another page, we send that high water mark back as part of the paging 
> > request.
> >
> > Each sstable and memtable contributing to the read responses will know 
> > which mutations it has in each partition, so if we encounter one that has a 
> > higher id than we saw in the last page, we reconstitute its data from 
> > mutations in the log, excluding the newer mutations., or exclude it 
> > entirely if it only has newer mutations.
> >
> > This isn’t free of course. When paging through large partitions, each page 
> > request becomes more likely to encounter mutations it needs to exclude, and 
> > it’s unclear how expensive that will be. Obviously it’s more expensive to 
> > reconstitute vs read, but on the other hand, only a single replica will be 
> > reading any data, so on balance it would still probably be less work for 
> > the cluster than running the normal read path.
> >
> > The other issue is that there isn’t a time bound on the paging payload, so 
> > if the application is taking long enough between pages that the log has 
> > been truncated, we’d have to throw an exception.
> >
> > This is mostly just me brainstorming though, and wouldn’t be something that 
> > would be in a v1.
> >
> > On Jan 9, 2025, at 2:07 PM, Blake Eggleston <beggles...@apple.com> wrote:
> >
> > So the ids themselves are in the memtable and are accessible as soon as 
> > they’re written, and need to be for the read path to work.
> >
> > We’re not able to reconcile the ids until we can guarantee that they won’t 
> > be merged with unreconciled data, that’s why they’re flushed before 
> > reconciliation.
> >
> >
> > On Jan 9, 2025, at 10:53 AM, Josh McKenzie <jmcken...@apache.org> wrote:
> >
> > We also can't remove mutation ids until they've been reconciled, so in the 
> > simplest implementation, we'd need to flush a memtable before reconciling, 
> > and there would never be a situation where you have purgeable mutation ids 
> > in the memtable.
> >
> > Got it. So effectively that data would be unreconcilable until such time as 
> > it was flushed and you had those id's to work with in the sstable metadata, 
> > and the process can force a flush to reconcile in those cases where you 
> > have mutations in the MT/CL combo that are transiently not subject to the 
> > reconciliation process due to that log being purged. Or you flush before 
> > purging the log, assuming we're not changing MT data structures to store id 
> > (don't recall if that's specified in the CEP...)
> >
> > Am I grokking that?
> >
> >
> > On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote:
> >
> > Hi Josh,
> >
> > You can think of reconciliation as analogous to incremental repair. Like 
> > incremental repair, you can't mix reconciled/unreconciled data without 
> > causing problem. We also can't remove mutation ids until they've been 
> > reconciled, so in the simplest implementation, we'd need to flush a 
> > memtable before reconciling, and there would never be a situation where you 
> > have purgeable mutation ids in the memtable.
> >
> > The production version of this will be more sophisticated about how it 
> > keeps this data separate to it can reliably support automatic 
> > reconciliation cadences that are higher than what you can do with 
> > incremental repair today, but that’s the short answer.
> >
> > It's also likely that the concept of log truncation will be removed in 
> > favor of going straight to cohort reconciliation in longer outages.
> >
> > Thanks,
> >
> > Blake
> >
> > On Jan 9, 2025, at 8:27 AM, Josh McKenzie <jmcken...@apache.org> wrote:
> >
> > Question re: Log Truncation (emphasis mine):
> >
> > When the cluster is operating normally, logs entries can be discarded once 
> > they are older than the last reconciliation time of their respective 
> > ranges. To prevent unbounded log growth during outages however, logs are 
> > still deleted once they reach some configurable amount of time (maybe 2 
> > hours by default?). From here, all reconciliation processes behave the same 
> > as before, but they use mutation ids stored in sstable metadata for listing 
> > mutation ids and transmit missing partitions.
> >
> >
> > What happens when / if we have data living in a memtable past that time 
> > threshold that hasn't yet been flushed to an sstable? i.e. low velocity 
> > table or a really tightly configured "purge my mutation reconciliation logs 
> > at time bound X".
> >
> > On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote:
> >
> > Is this something we can disable? I can see scenarios where this would be 
> > strictly and severely worse then existing scenarios where we don't need 
> > repairs. ie short time window data, millions of writes a second that get 
> > thrown out after a few hours. If that data is small partitions we are 
> > nearly doubling the disk use for things we don't care about.
> >
> > Chris
> >
> > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote:
> >
> > After a brief understanding, there are 2 questions from me, If I ask 
> > something inappropriate, please feel free to correct me :
> >
> > 1、 Does it support changing the table to support mutation tracking through 
> > ALTER TABLE if it does not support mutation tracking before?
> > 2、
> >
> > Available options for tables are keyspace, legacy, and logged, with the 
> > default being keyspace, which inherits the keyspace setting
> >
> >
> > Do you think that keyspace_inherit  (or other keywords that clearly explain 
> > the behavior ) is better than name keyspace ?
> > In addition, is legacy appropriate? Because this is a new feature, there is 
> > only the behavior of turning it on and off. Turning it off means not using 
> > this feature.
> > If the keyword legacy is used, from the user's perspective, is it using an 
> > old version of the mutation tracking? Similar to the relationship between 
> > SAI and native2i.
> >
> > Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道：
> >
> > JD, the fact that pagination is implemented as multiple queries is a design 
> > choice.  A user performs a query with fetch size 1 or 100 and they will get 
> > different behavior.
> >
> > I'm not asking for anyone to implement MVCC.  I'm asking for the docs 
> > around this to be correct.  We should not use the term guarantee here, it's 
> > best effort.
> >
> >
> >
> >
> > On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com> 
> > wrote:
> >
> >
> > Your pagination case is not a violation of any guarantees Cassandra makes. 
> > It has never made guarantees across multiple queries.
> > Trying to have MVCC/consistent data across multiple queries is a very 
> > different issue/problem from this CEP.  If you want to have a discussion 
> > about MVCC I suggest creating a new thread.
> >
> > -Jeremiah
> >
> > On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> >
> > 
> > > It's true that we can't offer multi-page write atomicity without some 
> > > sort of MVCC. There are a lot of common query patterns that don't involve 
> > > paging though, so it's not like the benefit of fixing write atomicity 
> > > would only apply to a small subset of carefully crafted queries or 
> > > something.
> >
> > Sure, it'll work a lot, but we don't say "partition level write atomicity 
> > some of the time".  We say guarantee.  From the CEP:
> >
> > > In the case of read repair, since we are only reading and correcting the 
> > > parts of a partition that we're reading and not the entire contents of a 
> > > partition on each read, read repair can break our guarantee on partition 
> > > level write atomicity. This approach also prevents meeting the monotonic 
> > > read requirement for witness replicas, which has significantly limited 
> > > its usefulness.
> >
> > I point this out because it's not well known, and we make a guarantee that 
> > isn't true, and while the CEP will reduce the number of cases in which we 
> > violate the guarantee, we will still have known edge cases that it doesn't 
> > hold up.  So we should stop saying it.
> >
> >
> >
> >
> > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> wrote:
> >
> > Thanks Dimitry and Jon, answers below
> >
> > 1) Is a single separate commit log expected to be created for all tables 
> > with the new replication type?
> >
> >
> > The plan is to still have a single commit log, but only index mutations 
> > with a mutation id.
> >
> > 2) What is a granularity of storing mutation ids in memtable, is it per 
> > cell?
> >
> >
> > It would be per-partition
> >
> > 3) If we update the same row multiple times while it is in a memtable - are 
> > all mutation ids appended to a kind of collection?
> >
> >
> > They would yes. We might be able to do something where we stop tracking 
> > mutations that have been superseded by newer mutations (same cells, higher 
> > timestamps), but I suspect that would be more trouble than it's worth and 
> > would be out of scope for v1.
> >
> > 4) What is the expected size of a single id?
> >
> >
> > It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
> >
> > 5) Do we plan to support multi-table batches (single or multi-partition) 
> > for this replication type?
> >
> >
> > This is intended to support all existing features, however the tracking 
> > only happens at the mutation level, so the different mutations coming out 
> > of a multi-partition batch would all be tracked individually
> >
> > So even without repair mucking things up, we're unable to fulfill this 
> > promise except under the specific, ideal circumstance of querying a 
> > partition with only 1 page.
> >
> >
> > It's true that we can't offer multi-page write atomicity without some sort 
> > of MVCC. There are a lot of common query patterns that don't involve paging 
> > though, so it's not like the benefit of fixing write atomicity would only 
> > apply to a small subset of carefully crafted queries or something.
> >
> > Thanks,
> >
> > Blake
> >
> > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> >
> > Very cool!  I'll need to spent some time reading this over.  One thing I 
> > did notice is this:
> >
> > > Cassandra promises partition level write atomicity. This means that, 
> > > although writes are eventually consistent, a given write will either be 
> > > visible or not visible. You're not supposed to see a partially applied 
> > > write. However, read repair and short read protection can both "tear" 
> > > mutations. In the case of read repair, this is because the data resolver 
> > > only evaluates the data included in the client read. So if your read only 
> > > covers a portion of a write that didn't reach a quorum, only that portion 
> > > will be repaired, breaking write atomicity.
> >
> > Unfortunately there's more issues with this than just repair.  Since we 
> > lack a consistency mechanism like MVCC while paginating, it's possible to 
> > do the following:
> >
> > thread A: reads a partition P with 10K rows, starts by reading the first 
> > page
> > thread B: another thread writes a batch to 2 rows in partition P, one on 
> > page 1, another on page 2
> > thread A: reads the second page of P which has the mutation.
> >
> > I've worked with users who have been surprised by this behavior, because 
> > pagination happens transparently.
> >
> > So even without repair mucking things up, we're unable to fulfill this 
> > promise except under the specific, ideal circumstance of querying a 
> > partition with only 1 page.
> >
> > Jon
> >
> >
> >
> >
> >
> > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> 
> > wrote:
> >
> > Hello dev@,
> >
> > We'd like to propose CEP-45: Mutation Tracking for adoption by the 
> > community. CEP-45 proposes adding a replication mechanism to track and 
> > reconcile individual mutations, as well as processes to actively reconcile 
> > missing mutations.
> >
> > For keyspaces with mutation tracking enabled, the immediate benefits of 
> > this CEP are:
> > * reduced replication lag with a continuous background reconciliation 
> > process
> > * eliminate the disk load caused by repair merkle tree calculation
> > * eliminate repair overstreaming
> > * reduce disk load of reads on cluster to close to 1/CL
> > * fix longstanding mutation atomicity issues caused by read repair and 
> > short read protection
> >
> > Additionally, although it's outside the scope of this CEP, mutation 
> > tracking would enable:
> > * completion of witness replicas / transient replication, making the 
> > feature usable for all workloads
> > * lightweight witness only datacenters
> >
> > The CEP is linked here: 
> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
> >  but please keep the discussion on the dev list.
> >
> > Thanks!
> >
> > Blake Eggleston
> >
> >
> >
> 
> 
> -- 
> http://twitter.com/tjake
>

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to