Re: [DISCUSS] CEP-45: Mutation Tracking

Blake Eggleston Thu, 16 Jan 2025 09:13:01 -0800

I’ve been thinking about the paging atomicity issue. I think it could be fixed 
with mutation tracking and without having to support full on MVCC.


When we reach a page boundary, we can send the highest mutation id we’ve seen 
for the partition we reached the paging boundary on. When we request another 
page, we send that high water mark back as part of the paging request.

Each sstable and memtable contributing to the read responses will know which 
mutations it has in each partition, so if we encounter one that has a higher id 
than we saw in the last page, we reconstitute its data from mutations in the 
log, excluding the newer mutations., or exclude it entirely if it only has 
newer mutations.

This isn’t free of course. When paging through large partitions, each page 
request becomes more likely to encounter mutations it needs to exclude, and 
it’s unclear how expensive that will be. Obviously it’s more expensive to 
reconstitute vs read, but on the other hand, only a single replica will be 
reading any data, so on balance it would still probably be less work for the 
cluster than running the normal read path.

The other issue is that there isn’t a time bound on the paging payload, so if 
the application is taking long enough between pages that the log has been 
truncated, we’d have to throw an exception.

This is mostly just me brainstorming though, and wouldn’t be something that 
would be in a v1.

> On Jan 9, 2025, at 2:07 PM, Blake Eggleston <[email protected]> wrote:
> 
> So the ids themselves are in the memtable and are accessible as soon as 
> they’re written, and need to be for the read path to work.
> 
> We’re not able to reconcile the ids until we can guarantee that they won’t be 
> merged with unreconciled data, that’s why they’re flushed before 
> reconciliation.
> 
> 
>> On Jan 9, 2025, at 10:53 AM, Josh McKenzie <[email protected]> wrote:
>> 
>>> We also can't remove mutation ids until they've been reconciled, so in the 
>>> simplest implementation, we'd need to flush a memtable before reconciling, 
>>> and there would never be a situation where you have purgeable mutation ids 
>>> in the memtable.
>> Got it. So effectively that data would be unreconcilable until such time as 
>> it was flushed and you had those id's to work with in the sstable metadata, 
>> and the process can force a flush to reconcile in those cases where you have 
>> mutations in the MT/CL combo that are transiently not subject to the 
>> reconciliation process due to that log being purged. Or you flush before 
>> purging the log, assuming we're not changing MT data structures to store id 
>> (don't recall if that's specified in the CEP...)
>> 
>> Am I grokking that?
>> 
>> 
>> On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote:
>>> Hi Josh,
>>> 
>>> You can think of reconciliation as analogous to incremental repair. Like 
>>> incremental repair, you can't mix reconciled/unreconciled data without 
>>> causing problem. We also can't remove mutation ids until they've been 
>>> reconciled, so in the simplest implementation, we'd need to flush a 
>>> memtable before reconciling, and there would never be a situation where you 
>>> have purgeable mutation ids in the memtable.
>>> 
>>> The production version of this will be more sophisticated about how it 
>>> keeps this data separate to it can reliably support automatic 
>>> reconciliation cadences that are higher than what you can do with 
>>> incremental repair today, but that’s the short answer.
>>> 
>>> It's also likely that the concept of log truncation will be removed in 
>>> favor of going straight to cohort reconciliation in longer outages.
>>> 
>>> Thanks,
>>> 
>>> Blake
>>> 
>>>> On Jan 9, 2025, at 8:27 AM, Josh McKenzie <[email protected]> wrote:
>>>> 
>>>> Question re: Log Truncation 
>>>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337676893#CEP45:MutationTracking-Logtruncation>
>>>>  (emphasis mine):
>>>> 
>>>>> When the cluster is operating normally, logs entries can be discarded 
>>>>> once they are older than the last reconciliation time of their respective 
>>>>> ranges. To prevent unbounded log growth during outages however, logs are 
>>>>> still deleted once they reach some configurable amount of time (maybe 2 
>>>>> hours by default?). From here, all reconciliation processes behave the 
>>>>> same as before, but they use mutation ids stored in sstable metadata for 
>>>>> listing mutation ids and transmit missing partitions.
>>>> 
>>>> What happens when / if we have data living in a memtable past that time 
>>>> threshold that hasn't yet been flushed to an sstable? i.e. low velocity 
>>>> table or a really tightly configured "purge my mutation reconciliation 
>>>> logs at time bound X".
>>>> 
>>>> On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote:
>>>>> Is this something we can disable? I can see scenarios where this would be 
>>>>> strictly and severely worse then existing scenarios where we don't need 
>>>>> repairs. ie short time window data, millions of writes a second that get 
>>>>> thrown out after a few hours. If that data is small partitions we are 
>>>>> nearly doubling the disk use for things we don't care about.
>>>>> 
>>>>> Chris
>>>>> 
>>>>> On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> After a brief understanding, there are 2 questions from me, If I ask 
>>>>> something inappropriate, please feel free to correct me :
>>>>> 
>>>>> 1、 Does it support changing the table to support mutation tracking 
>>>>> through ALTER TABLE if it does not support mutation tracking before?
>>>>> 2、
>>>>> Available options for tables are keyspace, legacy, and logged, with the 
>>>>> default being keyspace, which inherits the keyspace setting
>>>>>  
>>>>> Do you think that keyspace_inherit  (or other keywords that clearly 
>>>>> explain the behavior ) is better than name keyspace ?  
>>>>> In addition, is legacy appropriate? Because this is a new feature, there 
>>>>> is only the behavior of turning it on and off. Turning it off means not 
>>>>> using this feature. 
>>>>> If the keyword legacy is used, from the user's perspective, is it using 
>>>>> an old version of the mutation tracking? Similar to the relationship 
>>>>> between SAI and native2i.
>>>>> 
>>>>> Jon Haddad <[email protected] <mailto:[email protected]>> 
>>>>> 于2025年1月9日周四 06:14写道：
>>>>> JD, the fact that pagination is implemented as multiple queries is a 
>>>>> design choice.  A user performs a query with fetch size 1 or 100 and they 
>>>>> will get different behavior. 
>>>>> 
>>>>> I'm not asking for anyone to implement MVCC.  I'm asking for the docs 
>>>>> around this to be correct.  We should not use the term guarantee here, 
>>>>> it's best effort.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Your pagination case is not a violation of any guarantees Cassandra 
>>>>> makes. It has never made guarantees across multiple queries.
>>>>> Trying to have MVCC/consistent data across multiple queries is a very 
>>>>> different issue/problem from this CEP.  If you want to have a discussion 
>>>>> about MVCC I suggest creating a new thread.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> > It's true that we can't offer multi-page write atomicity without some 
>>>>>> > sort of MVCC. There are a lot of common query patterns that don't 
>>>>>> > involve paging though, so it's not like the benefit of fixing write 
>>>>>> > atomicity would only apply to a small subset of carefully crafted 
>>>>>> > queries or something.
>>>>>> 
>>>>>> Sure, it'll work a lot, but we don't say "partition level write 
>>>>>> atomicity some of the time".  We say guarantee.  From the CEP:
>>>>>> 
>>>>>> > In the case of read repair, since we are only reading and correcting 
>>>>>> > the parts of a partition that we're reading and not the entire 
>>>>>> > contents of a partition on each read, read repair can break our 
>>>>>> > guarantee on partition level write atomicity. This approach also 
>>>>>> > prevents meeting the monotonic read requirement for witness replicas, 
>>>>>> > which has significantly limited its usefulness.
>>>>>> 
>>>>>> I point this out because it's not well known, and we make a guarantee 
>>>>>> that isn't true, and while the CEP will reduce the number of cases in 
>>>>>> which we violate the guarantee, we will still have known edge cases that 
>>>>>> it doesn't hold up.  So we should stop saying it. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Thanks Dimitry and Jon, answers below
>>>>>> 
>>>>>>> 1) Is a single separate commit log expected to be created for all 
>>>>>>> tables with the new replication type?
>>>>>> 
>>>>>> The plan is to still have a single commit log, but only index mutations 
>>>>>> with a mutation id. 
>>>>>> 
>>>>>>> 2) What is a granularity of storing mutation ids in memtable, is it per 
>>>>>>> cell?
>>>>>> 
>>>>>> It would be per-partition
>>>>>> 
>>>>>>> 3) If we update the same row multiple times while it is in a memtable - 
>>>>>>> are all mutation ids appended to a kind of collection?
>>>>>> 
>>>>>> They would yes. We might be able to do something where we stop tracking 
>>>>>> mutations that have been superseded by newer mutations (same cells, 
>>>>>> higher timestamps), but I suspect that would be more trouble than it's 
>>>>>> worth and would be out of scope for v1.
>>>>>> 
>>>>>>> 4) What is the expected size of a single id?
>>>>>> 
>>>>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
>>>>>> 
>>>>>>> 5) Do we plan to support multi-table batches (single or 
>>>>>>> multi-partition) for this replication type?
>>>>>> 
>>>>>> 
>>>>>> This is intended to support all existing features, however the tracking 
>>>>>> only happens at the mutation level, so the different mutations coming 
>>>>>> out of a multi-partition batch would all be tracked individually
>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>> 
>>>>>> 
>>>>>> It's true that we can't offer multi-page write atomicity without some 
>>>>>> sort of MVCC. There are a lot of common query patterns that don't 
>>>>>> involve paging though, so it's not like the benefit of fixing write 
>>>>>> atomicity would only apply to a small subset of carefully crafted 
>>>>>> queries or something.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Blake
>>>>>> 
>>>>>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Very cool!  I'll need to spent some time reading this over.  One thing 
>>>>>>> I did notice is this:
>>>>>>> 
>>>>>>> > Cassandra promises partition level write atomicity. This means that, 
>>>>>>> > although writes are eventually consistent, a given write will either 
>>>>>>> > be visible or not visible. You're not supposed to see a partially 
>>>>>>> > applied write. However, read repair and short read protection can 
>>>>>>> > both "tear" mutations. In the case of read repair, this is because 
>>>>>>> > the data resolver only evaluates the data included in the client 
>>>>>>> > read. So if your read only covers a portion of a write that didn't 
>>>>>>> > reach a quorum, only that portion will be repaired, breaking write 
>>>>>>> > atomicity.
>>>>>>> 
>>>>>>> Unfortunately there's more issues with this than just repair.  Since we 
>>>>>>> lack a consistency mechanism like MVCC while paginating, it's possible 
>>>>>>> to do the following:
>>>>>>> 
>>>>>>> thread A: reads a partition P with 10K rows, starts by reading the 
>>>>>>> first page
>>>>>>> thread B: another thread writes a batch to 2 rows in partition P, one 
>>>>>>> on page 1, another on page 2
>>>>>>> thread A: reads the second page of P which has the mutation.
>>>>>>> 
>>>>>>> I've worked with users who have been surprised by this behavior, 
>>>>>>> because pagination happens transparently.
>>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>>> 
>>>>>>> Jon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Hello dev@,
>>>>>>> 
>>>>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>>>>>>> community. CEP-45 proposes adding a replication mechanism to track and 
>>>>>>> reconcile individual mutations, as well as processes to actively 
>>>>>>> reconcile missing mutations.
>>>>>>> 
>>>>>>> For keyspaces with mutation tracking enabled, the immediate benefits of 
>>>>>>> this CEP are:
>>>>>>> * reduced replication lag with a continuous background reconciliation 
>>>>>>> process
>>>>>>> * eliminate the disk load caused by repair merkle tree calculation
>>>>>>> * eliminate repair overstreaming
>>>>>>> * reduce disk load of reads on cluster to close to 1/CL
>>>>>>> * fix longstanding mutation atomicity issues caused by read repair and 
>>>>>>> short read protection
>>>>>>> 
>>>>>>> Additionally, although it's outside the scope of this CEP, mutation 
>>>>>>> tracking would enable:
>>>>>>> * completion of witness replicas / transient replication, making the 
>>>>>>> feature usable for all workloads
>>>>>>> * lightweight witness only datacenters
>>>>>>> 
>>>>>>> The CEP is linked here: 
>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>>>>  but please keep the discussion on the dev list.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Blake Eggleston
>

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to