Re: [DISCUSS] CEP-45: Mutation Tracking

Blake Eggleston Tue, 28 Jan 2025 09:45:09 -0800

Hi dev@,

Looks like it's been about 10 days since the last message here. Are there any 
other comments before I put it up for a vote?


Thanks,

Blake

> On Jan 18, 2025, at 12:33 PM, Blake Eggleston <beggles...@apple.com> wrote:
> 
> That's an interesting idea. Basically allow for a window of uncertainty 
> between the memtable and log and merge mutations within that window directly 
> into the response. It sounds like something that could work.
> 
> I'll have to think about how not embedding id info into the storage layer 
> might interact with the sstable silo requirements we have for reconciliation 
> (basically the same thing we do for incremental repair), but  it's likely you 
> could do something similar there as well.
> 
>> On Jan 18, 2025, at 12:16 PM, Benedict <bened...@apache.org> wrote:
>> 
>> That’s great to hear, I had thought the goal for embedding this information 
>> in sstables was that the log could be truncated. If not, is the below 
>> snippet the main motivation?
>> 
>>> For the nodes returning data _and_ mutation ids, the data and mutation ids 
>>> need to describe each other exactly. If the data returned is missing data 
>>> the mutation ids say are there, or has data the mutation ids say aren't, 
>>> you'll have a read correctness issue.
>> 
>> 
>> If so, I don’t think this is really a problem and we should perhaps 
>> reconsider. With the magic of LSM we can merge redundant information from 
>> the log into our read response, so we only need to be sure we know a point 
>> in the log before which data must be in memtables. Anything after that point 
>> might or might not be, and can simply be merged into the read response 
>> (potentially redundantly).
>> 
>> This would seem to fall neatly into the reconciliation read path anyway; we 
>> are looking for any data in a (local or remote) journal that we haven’t 
>> written to the data store yet. If it isn’t known to be durable at a majority 
>> then we have to perform a distributed write of the mutation. It doesn’t seem 
>> like we need to do anything particularly special? 
>> 
>> We can wait until we have the total set of mutations to merge and then we 
>> have our complete and consistent read response
>> 
>>> wouldn't be a bad idea to write most recent mutation id to a table every 
>>> few seconds asynchronously
>> 
>> For accord we will write reservation records in advance so we can guarantee 
>> we don’t go backwards. That is, we will periodically declare a point eg 10s 
>> in the future that on restart we will have to first let elapse if we’re 
>> behind.
>> 
>>> On 18 Jan 2025, at 18:31, Blake Eggleston <beggles...@apple.com> wrote:
>>> 
>>> No, mutations are kept intact. If a node is missing a multi-table 
>>> mutation, it will receive the entire mutation on reconciliation.
>>> 
>>> Regarding HLCs, I vaguely remember hearing about a paxos outage maybe 9-10 
>>> years ago that was related to a leap hour or leap second or something 
>>> causing clocks to not behave as expected and ballots to be created slightly 
>>> in the past. There may be some rare edge cases we're not thinking about and 
>>> it wouldn't be a bad idea to write most recent mutation id to a table every 
>>> few seconds asynchronously so we don't create a giant mess if we restart 
>>> during them.
>>> 
>>>> On Jan 18, 2025, at 2:18 AM, Benedict <bened...@apache.org> wrote:
>>>> 
>>>> Does this approach potentially fail to guarantee multi table atomicity? If 
>>>> we’re reconciling mutation ids separately per table, an atomic batch write 
>>>> might get reconciled for one table but not another? I know that atomic 
>>>> batch updates on a single partition key to multiple tables is an important 
>>>> property for some users (though, read repair suffers this same problem - 
>>>> but it would be a real shame not to close this gap while we’re fixing our 
>>>> semantics, so we’re left only with paging isolation to contend with in 
>>>> future)
>>>> 
>>>> Regarding unique HLCs Jon, before we go to prod in any cluster we’ll want 
>>>> Accord to guarantee HLCs are unique, so we’ll probably have a journal 
>>>> record reserve a batch of HLCs in advance, so we know what HLC it is safe 
>>>> to reset to on restart. I’m sure this work can use the same feature, 
>>>> though I agree with Blake it’s likely an unrealistic case in anything but 
>>>> adversarial test scenarios.
>>>> 
>>>>> On 17 Jan 2025, at 22:52, Blake Eggleston <beggles...@apple.com> wrote:
>>>>> 
>>>>> 
>>>>> Hi Jon, thanks for the excellent questions, answers below
>>>>> 
>>>>>> Write Path - for recovery, how does a node safely recover the highest 
>>>>>> hybrid logical clock it has issued? Checking the last entry in the
>>>>>> addressable log is insufficient unless we ensure every individual update 
>>>>>> is durable, rather than batched/periodic. Something like leasing to an 
>>>>>> upper bound could work.
>>>>> 
>>>>> It doesn't. We assume that the time it takes to restart will prevent 
>>>>> issuing ids from the (logical) past. The HLC currently uses time in 
>>>>> milliseconds, and multiplies that into microseconds. So as long as a 
>>>>> given node is coordinating less than 1,000,000 writes a second and takes 
>>>>> more than a second to startup, that shouldn't be possible.
>>>>> 
>>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they 
>>>>>> map to mutated partitions, or is it a multimap of partitions to mutation 
>>>>>> id? (question is motivated by not understanding how they are used after 
>>>>>> log truncation and during bootstrap).
>>>>> 
>>>>> It's basically a map of partition keys to a set of mutation ids that are 
>>>>> represented by that sstable. Mutation ids can't belong to more than a 
>>>>> single partition key per table, so no multimap. After full reconciliation 
>>>>> / log truncation, the ids are not used and can be removed on compaction. 
>>>>> The non-reconciled log truncation idea discussed in the CEP seems like it 
>>>>> will go away in favor of partial/cohort reconciliations. They're included 
>>>>> in the sstable in lieu of including a second log index mapping keys to 
>>>>> mutations ids on the log, although it may have other uses, such as fixing 
>>>>> mutation atomicity across pages. 
>>>>> 
>>>>> What's not stated explicitly in the CEP (since I only realized it once I 
>>>>> started prototyping) is that embedding the mutation ids in the storage 
>>>>> layer solves a concurrency issue on the read path. For the nodes 
>>>>> returning data _and_ mutation ids, the data and mutation ids need to 
>>>>> describe each other exactly. If the data returned is missing data the 
>>>>> mutation ids say are there, or has data the mutation ids say aren't, 
>>>>> you'll have a read correctness issue. Since appending to the commit log 
>>>>> and updating the memtable aren't really synchronized from the perspective 
>>>>> of read visibility, putting the ids in the memtable on write solves this 
>>>>> issue while preventing having to change how commit log / memtable 
>>>>> concurrency works. Including the ids in the sstable isn't strictly 
>>>>> necessary to fix the concurrency issue, but is convenient.
>>>>> 
>>>>>> Log Reconciliation - how is this scheduled within a replica group? Are 
>>>>>> there any interactions/commonality with CEP-37 the unified repair 
>>>>>> scheduler?
>>>>> 
>>>>> It's kind of hand wavy at the moment tbh. If CEP-37 meets our scheduling 
>>>>> needs and is ready in time, it would be great to not have to reinvent it. 
>>>>> However, the read path will be a lot more sensitive to unreconciled data 
>>>>> that it is to unrepaired data, so the 2 systems may end up having 
>>>>> different enough requirements that we have to do something separate.
>>>>> 
>>>>>> Cohort reconciliation
>>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there 
>>>>>> restrictions about how many cohorts an instance belongs to (one at a 
>>>>>> time)? What determines the membership of a cohort, is it agreed as part 
>>>>>> of running the partial reconciliation? Other members of the cohort may 
>>>>>> be able to see a different subset of the nodes e.g. network 
>>>>>> misconfiguration with three DCs where one DC is missing routing to 
>>>>>> another.
>>>>>> - I assume the cohort reconciliation id is reused for subsequent partial 
>>>>>> reconciliations only if the cohort members remain the same.
>>>>> 
>>>>> Cohorts are basically the nodes that can talk to each other. The cohort 
>>>>> reconciliation has the same sort of mutation low bound logic as full 
>>>>> reconciliations, so a given node/range combo can only belong to a single 
>>>>> cohort at a time, and that's determined as part of the reconciliation 
>>>>> setup process. The cohort id is reused for subsequent partial 
>>>>> reconciliations so long as the members remain the same. This lets us 
>>>>> compact data from the cohort together.
>>>>> 
>>>>>> - Are the reconciled mutations in the addressable log rewritten under 
>>>>>> the cohort reconciliation id, or is the reference to them updated?
>>>>> 
>>>>> So for basic mutation tracking, the log entries are removed and you're 
>>>>> left with an sstable silo, like pending repairs. For instance, in cases 
>>>>> where you have a node down for a week, you don't want to accumulate a 
>>>>> weeks worth of data and a weeks worth of logs. In the future, for 
>>>>> witnesses where you don't have sstables, it's less clear. Maybe it will 
>>>>> be better to keep a weeks worth of logs around, maybe it will be better 
>>>>> to periodically materialize the cohort log data into sstables.
>>>>> 
>>>>>> - When the partition heals, if you process a read request that contains 
>>>>>> a cohort reconciliation id, is there a risk that you have to transfer 
>>>>>> large amounts of data before you can process, or does the addressable 
>>>>>> log allow filtering by partition?
>>>>> 
>>>>> Yeah that's a risk. We could probably determine during the read that a 
>>>>> given cohort does not contain a key for a read, but if it does, you'll 
>>>>> have to wait. The reads themselves shouldn't be initiating reconciliation 
>>>>> for the cohorts though, nodes will start exchanging cohort data as soon 
>>>>> as they're able to connect to a previously unreachable node. I think read 
>>>>> speculation will help here, and we may also be able to do something where 
>>>>> we pull in just the data we need to the read to minimize impact on 
>>>>> availability while maintaining read monotonicity.
>>>>> 
>>>>>> - Should the code be structured that cohort reconciliations are the 
>>>>>> expected case, and as an optimization if all replicas are in the cohort, 
>>>>>> then they
>>>>>> can bump the lower id.
>>>>> 
>>>>> That's not a bad idea, both processes will have a lot in common.
>>>>> 
>>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a 
>>>>>> single host (the initiator?) or do they have a different structure?
>>>>> 
>>>>> 
>>>>> I'm not sure, I'd kind of assumed we'd just call UUID.randomUUID 
>>>>> everytime the cohort changed.
>>>>> 
>>>>>> Log truncation - if log truncation occurs and mutations come from 
>>>>>> sstables, will the mutations be the actual logged mutation (seems 
>>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that 
>>>>>> represent the current state in the sstable? If so, would the inclusion 
>>>>>> of later mutations after the mutation/cohort id in that partition cause 
>>>>>> any issues with reconciliation? (I see there's a hint about this in the 
>>>>>> bootstrap section below)
>>>>> 
>>>>> So the log truncation stuff will probably go away in favor of cohort 
>>>>> reconciliation. The idea though was that yeah, you'd have a sort of 
>>>>> pseudo multi-mutation (assuming there are multiple ids represented) 
>>>>> created from the sstable partition. Inclusion of later mutations 
>>>>> shouldn't cause any problems. Everything should be commutative so long as 
>>>>> we're not purging tombstones (which we won't if the data isn't fully 
>>>>> reconciled).
>>>>> 
>>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? 
>>>>>> Preserving that would make it possible to switch between strategies,
>>>>>> it doesn't seem that complex, but maybe I'm missing something.
>>>>> 
>>>>> 
>>>>> Yes, keeping that consistent and easy to migrate is a goal.
>>>>> 
>>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM 
>>>>>> currently handles that. Would it need to be added to make mutation 
>>>>>> tracking work? Where would the metadata be stored to indicate preferred 
>>>>>> sources for missing mutations? Would that also extend to nodes that have 
>>>>>> had to perform log truncation?
>>>>> 
>>>>> 
>>>>> That's a really good question, I hadn't thought of that. It would be nice 
>>>>> if RF changes got the same pending/streaming treatment that token range 
>>>>> changes did. Not sure how difficult it would be to add that for at least 
>>>>> tables that are using mutation tracking. Using the normal add/repair 
>>>>> workflow we do now would probably be workable though, and would have the 
>>>>> advantage of the coordinator being able to detect and exclude nodes that 
>>>>> haven't received data for their new ranges though.
>>>>> 
>>>>>> Compaction - how are the mutation ids in sstable metadata handled when 
>>>>>> multiple sstables are compacted, particularly with something like
>>>>>> range aware writers or when splitting the output over multiple 
>>>>>> size-bounded sstables.  A simple union could expand the number
>>>>>> of sstables to consider after log truncation.
>>>>> 
>>>>> On compaction the ids for a partition would be merged, but ids that have 
>>>>> been reconciled are also removed. I'm not sure if we split partitions 
>>>>> across multiple sstables on compaction though. I suppose it's possible, 
>>>>> though I don't know if it would have an impact if the log truncation part 
>>>>> of the CEP ends up going away.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Blake
>>>>> 
>>>>>> On Jan 17, 2025, at 9:27 AM, Jon Meredith <jonmered...@apache.org> wrote:
>>>>>> 
>>>>>> I had another read through for the CEP and had some follow up 
>>>>>> questions/thoughts.
>>>>>> 
>>>>>> Write Path - for recovery, how does a node safely recover the highest 
>>>>>> hybrid logical clock it has issued? Checking the last entry in the
>>>>>> addressable log is insufficient unless we ensure every individual update 
>>>>>> is durable, rather than batched/periodic. Something like leasing to an 
>>>>>> upper bound could work.
>>>>>> 
>>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they 
>>>>>> map to mutated partitions, or is it a multimap of partitions to mutation 
>>>>>> id? (question is motivated by not understanding how they are used after 
>>>>>> log truncation and during bootstrap).
>>>>>> 
>>>>>> Log Reconciliation - how is this scheduled within a replica group? Are 
>>>>>> there any interactions/commonality with CEP-37 the unified repair 
>>>>>> scheduler?
>>>>>> 
>>>>>> Cohort reconciliation
>>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there 
>>>>>> restrictions about how many cohorts an instance belongs to (one at a 
>>>>>> time)? What determines the membership of a cohort, is it agreed as part 
>>>>>> of running the partial reconciliation? Other members of the cohort may 
>>>>>> be able to see a different subset of the nodes e.g. network 
>>>>>> misconfiguration with three DCs where one DC is missing routing to 
>>>>>> another.
>>>>>> - I assume the cohort reconciliation id is reused for subsequent partial 
>>>>>> reconciliations only if the cohort members remain the same.
>>>>>> - Are the reconciled mutations in the addressable log rewritten under 
>>>>>> the cohort reconciliation id, or is the reference to them updated?
>>>>>> - When the partition heals, if you process a read request that contains 
>>>>>> a cohort reconciliation id, is there a risk that you have to transfer 
>>>>>> large amounts of data before you can process, or does the addressable 
>>>>>> log allow filtering by partition?
>>>>>> - Should the code be structured that cohort reconciliations are the 
>>>>>> expected case, and as an optimization if all replicas are in the cohort, 
>>>>>> then they
>>>>>> can bump the lower id.
>>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a 
>>>>>> single host (the initiator?) or do they have a different structure?
>>>>>> 
>>>>>> Log truncation - if log truncation occurs and mutations come from 
>>>>>> sstables, will the mutations be the actual logged mutation (seems 
>>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that 
>>>>>> represent the current state in the sstable? If so, would the inclusion 
>>>>>> of later mutations after the mutation/cohort id in that partition cause 
>>>>>> any issues with reconciliation? (I see there's a hint about this in the 
>>>>>> bootstrap section below)
>>>>>> 
>>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? 
>>>>>> Preserving that would make it possible to switch between strategies,
>>>>>> it doesn't seem that complex, but maybe I'm missing something.
>>>>>> 
>>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM 
>>>>>> currently handles that. Would it need to be added to make mutation 
>>>>>> tracking work? Where would the metadata be stored to indicate preferred 
>>>>>> sources for missing mutations? Would that also extend to nodes that have 
>>>>>> had to perform log truncation?
>>>>>> 
>>>>>> Additional concerns
>>>>>> 
>>>>>> Compaction - how are the mutation ids in sstable metadata handled when 
>>>>>> multiple sstables are compacted, particularly with something like
>>>>>> range aware writers or when splitting the output over multiple 
>>>>>> size-bounded sstables.  A simple union could expand the number
>>>>>> of sstables to consider after log truncation.
>>>>>> 
>>>>>> Thanks!
>>>>>> Jon
>>>>>> 
>>>>>> On Thu, Jan 16, 2025 at 11:51 AM Blake Eggleston <beggles...@apple.com 
>>>>>> <mailto:beggles...@apple.com>> wrote:
>>>>>>> I’m not sure Josh. Jon brought up paging and the documentation around 
>>>>>>> it because our docs say we provide mutation level atomicity, but we 
>>>>>>> also provide drivers that page transparently. So from the user’s 
>>>>>>> perspective, a single “query” breaks this guarantee unpredictably. 
>>>>>>> Occasional exceptions with a clear message explaining what is 
>>>>>>> happening, why, and how to fix it is going to be less confusing that 
>>>>>>> tracking down application misbehavior caused by this.
>>>>>>> 
>>>>>>> It would also be easy to make the time horizon for paging constant and 
>>>>>>> configurable (keep at least 20 minutes of logs, for instance), that 
>>>>>>> would at least provide a floor of predictability.
>>>>>>> 
>>>>>>>> On Jan 16, 2025, at 10:08 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>>> The other issue is that there isn’t a time bound on the paging 
>>>>>>>>> payload, so if the application is taking long enough between pages 
>>>>>>>>> that the log has been truncated, we’d have to throw an exception.
>>>>>>>> My hot-take is that this relationship between how long you're taking 
>>>>>>>> to page, how much data you're processing / getting back, and ingest / 
>>>>>>>> flushing frequency all combined leading to unpredictable exceptions 
>>>>>>>> would be a bad default from a UX perspective compared to a default of 
>>>>>>>> "a single page of data has atomicity; multiple pages do not". Maybe 
>>>>>>>> it's just because that's been our default for so long.
>>>>>>>> 
>>>>>>>> The simplicity of having a flag that's "don't make my pages atomic and 
>>>>>>>> they always return vs. make my pages atomic and throw exceptions if 
>>>>>>>> the metadata I need is yoinked while I page" is pretty attractive to 
>>>>>>>> me.
>>>>>>>> 
>>>>>>>> Really interesting thought, using these logs as "partial MVCC" while 
>>>>>>>> they're available specifically for what could/should be a very tight 
>>>>>>>> timeline use-case (paging).
>>>>>>>> 
>>>>>>>> On Thu, Jan 16, 2025, at 12:41 PM, Jake Luciani wrote:
>>>>>>>>> This is very cool!
>>>>>>>>> 
>>>>>>>>> I have done a POC that was similar but more akin to Aurora paper
>>>>>>>>> whereby the commitlog itself would repair itself from peers
>>>>>>>>> proactively using the seekable commitlog.
>>>>>>>>> 
>>>>>>>>> Can you explain the reason you prefer to reconcile on read?  Having a
>>>>>>>>> consistent commitlog would solve so many problems like CDC, PITR, MVs
>>>>>>>>> etc.
>>>>>>>>> 
>>>>>>>>> Jake
>>>>>>>>> 
>>>>>>>>> On Thu, Jan 16, 2025 at 12:13 PM Blake Eggleston 
>>>>>>>>> <beggles...@apple.com <mailto:beggles...@apple.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > I’ve been thinking about the paging atomicity issue. I think it 
>>>>>>>>> > could be fixed with mutation tracking and without having to support 
>>>>>>>>> > full on MVCC.
>>>>>>>>> >
>>>>>>>>> > When we reach a page boundary, we can send the highest mutation id 
>>>>>>>>> > we’ve seen for the partition we reached the paging boundary on. 
>>>>>>>>> > When we request another page, we send that high water mark back as 
>>>>>>>>> > part of the paging request.
>>>>>>>>> >
>>>>>>>>> > Each sstable and memtable contributing to the read responses will 
>>>>>>>>> > know which mutations it has in each partition, so if we encounter 
>>>>>>>>> > one that has a higher id than we saw in the last page, we 
>>>>>>>>> > reconstitute its data from mutations in the log, excluding the 
>>>>>>>>> > newer mutations., or exclude it entirely if it only has newer 
>>>>>>>>> > mutations.
>>>>>>>>> >
>>>>>>>>> > This isn’t free of course. When paging through large partitions, 
>>>>>>>>> > each page request becomes more likely to encounter mutations it 
>>>>>>>>> > needs to exclude, and it’s unclear how expensive that will be. 
>>>>>>>>> > Obviously it’s more expensive to reconstitute vs read, but on the 
>>>>>>>>> > other hand, only a single replica will be reading any data, so on 
>>>>>>>>> > balance it would still probably be less work for the cluster than 
>>>>>>>>> > running the normal read path.
>>>>>>>>> >
>>>>>>>>> > The other issue is that there isn’t a time bound on the paging 
>>>>>>>>> > payload, so if the application is taking long enough between pages 
>>>>>>>>> > that the log has been truncated, we’d have to throw an exception.
>>>>>>>>> >
>>>>>>>>> > This is mostly just me brainstorming though, and wouldn’t be 
>>>>>>>>> > something that would be in a v1.
>>>>>>>>> >
>>>>>>>>> > On Jan 9, 2025, at 2:07 PM, Blake Eggleston <beggles...@apple.com 
>>>>>>>>> > <mailto:beggles...@apple.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > So the ids themselves are in the memtable and are accessible as 
>>>>>>>>> > soon as they’re written, and need to be for the read path to work.
>>>>>>>>> >
>>>>>>>>> > We’re not able to reconcile the ids until we can guarantee that 
>>>>>>>>> > they won’t be merged with unreconciled data, that’s why they’re 
>>>>>>>>> > flushed before reconciliation.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Jan 9, 2025, at 10:53 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>>>> > <mailto:jmcken...@apache.org>> wrote:
>>>>>>>>> >
>>>>>>>>> > We also can't remove mutation ids until they've been reconciled, so 
>>>>>>>>> > in the simplest implementation, we'd need to flush a memtable 
>>>>>>>>> > before reconciling, and there would never be a situation where you 
>>>>>>>>> > have purgeable mutation ids in the memtable.
>>>>>>>>> >
>>>>>>>>> > Got it. So effectively that data would be unreconcilable until such 
>>>>>>>>> > time as it was flushed and you had those id's to work with in the 
>>>>>>>>> > sstable metadata, and the process can force a flush to reconcile in 
>>>>>>>>> > those cases where you have mutations in the MT/CL combo that are 
>>>>>>>>> > transiently not subject to the reconciliation process due to that 
>>>>>>>>> > log being purged. Or you flush before purging the log, assuming 
>>>>>>>>> > we're not changing MT data structures to store id (don't recall if 
>>>>>>>>> > that's specified in the CEP...)
>>>>>>>>> >
>>>>>>>>> > Am I grokking that?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote:
>>>>>>>>> >
>>>>>>>>> > Hi Josh,
>>>>>>>>> >
>>>>>>>>> > You can think of reconciliation as analogous to incremental repair. 
>>>>>>>>> > Like incremental repair, you can't mix reconciled/unreconciled data 
>>>>>>>>> > without causing problem. We also can't remove mutation ids until 
>>>>>>>>> > they've been reconciled, so in the simplest implementation, we'd 
>>>>>>>>> > need to flush a memtable before reconciling, and there would never 
>>>>>>>>> > be a situation where you have purgeable mutation ids in the 
>>>>>>>>> > memtable.
>>>>>>>>> >
>>>>>>>>> > The production version of this will be more sophisticated about how 
>>>>>>>>> > it keeps this data separate to it can reliably support automatic 
>>>>>>>>> > reconciliation cadences that are higher than what you can do with 
>>>>>>>>> > incremental repair today, but that’s the short answer.
>>>>>>>>> >
>>>>>>>>> > It's also likely that the concept of log truncation will be removed 
>>>>>>>>> > in favor of going straight to cohort reconciliation in longer 
>>>>>>>>> > outages.
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> >
>>>>>>>>> > Blake
>>>>>>>>> >
>>>>>>>>> > On Jan 9, 2025, at 8:27 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>>>> > <mailto:jmcken...@apache.org>> wrote:
>>>>>>>>> >
>>>>>>>>> > Question re: Log Truncation (emphasis mine):
>>>>>>>>> >
>>>>>>>>> > When the cluster is operating normally, logs entries can be 
>>>>>>>>> > discarded once they are older than the last reconciliation time of 
>>>>>>>>> > their respective ranges. To prevent unbounded log growth during 
>>>>>>>>> > outages however, logs are still deleted once they reach some 
>>>>>>>>> > configurable amount of time (maybe 2 hours by default?). From here, 
>>>>>>>>> > all reconciliation processes behave the same as before, but they 
>>>>>>>>> > use mutation ids stored in sstable metadata for listing mutation 
>>>>>>>>> > ids and transmit missing partitions.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > What happens when / if we have data living in a memtable past that 
>>>>>>>>> > time threshold that hasn't yet been flushed to an sstable? i.e. low 
>>>>>>>>> > velocity table or a really tightly configured "purge my mutation 
>>>>>>>>> > reconciliation logs at time bound X".
>>>>>>>>> >
>>>>>>>>> > On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote:
>>>>>>>>> >
>>>>>>>>> > Is this something we can disable? I can see scenarios where this 
>>>>>>>>> > would be strictly and severely worse then existing scenarios where 
>>>>>>>>> > we don't need repairs. ie short time window data, millions of 
>>>>>>>>> > writes a second that get thrown out after a few hours. If that data 
>>>>>>>>> > is small partitions we are nearly doubling the disk use for things 
>>>>>>>>> > we don't care about.
>>>>>>>>> >
>>>>>>>>> > Chris
>>>>>>>>> >
>>>>>>>>> > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com 
>>>>>>>>> > <mailto:cclive1...@gmail.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > After a brief understanding, there are 2 questions from me, If I 
>>>>>>>>> > ask something inappropriate, please feel free to correct me :
>>>>>>>>> >
>>>>>>>>> > 1、 Does it support changing the table to support mutation tracking 
>>>>>>>>> > through ALTER TABLE if it does not support mutation tracking before?
>>>>>>>>> > 2、
>>>>>>>>> >
>>>>>>>>> > Available options for tables are keyspace, legacy, and logged, with 
>>>>>>>>> > the default being keyspace, which inherits the keyspace setting
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Do you think that keyspace_inherit  (or other keywords that clearly 
>>>>>>>>> > explain the behavior ) is better than name keyspace ?
>>>>>>>>> > In addition, is legacy appropriate? Because this is a new feature, 
>>>>>>>>> > there is only the behavior of turning it on and off. Turning it off 
>>>>>>>>> > means not using this feature.
>>>>>>>>> > If the keyword legacy is used, from the user's perspective, is it 
>>>>>>>>> > using an old version of the mutation tracking? Similar to the 
>>>>>>>>> > relationship between SAI and native2i.
>>>>>>>>> >
>>>>>>>>> > Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>>> > <mailto:j...@rustyrazorblade.com>> 于2025年1月9日周四 06:14写道：
>>>>>>>>> >
>>>>>>>>> > JD, the fact that pagination is implemented as multiple queries is 
>>>>>>>>> > a design choice.  A user performs a query with fetch size 1 or 100 
>>>>>>>>> > and they will get different behavior.
>>>>>>>>> >
>>>>>>>>> > I'm not asking for anyone to implement MVCC.  I'm asking for the 
>>>>>>>>> > docs around this to be correct.  We should not use the term 
>>>>>>>>> > guarantee here, it's best effort.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan 
>>>>>>>>> > <jeremiah.jor...@gmail.com <mailto:jeremiah.jor...@gmail.com>> 
>>>>>>>>> > wrote:
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Your pagination case is not a violation of any guarantees Cassandra 
>>>>>>>>> > makes. It has never made guarantees across multiple queries.
>>>>>>>>> > Trying to have MVCC/consistent data across multiple queries is a 
>>>>>>>>> > very different issue/problem from this CEP.  If you want to have a 
>>>>>>>>> > discussion about MVCC I suggest creating a new thread.
>>>>>>>>> >
>>>>>>>>> > -Jeremiah
>>>>>>>>> >
>>>>>>>>> > On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > 
>>>>>>>>> > > It's true that we can't offer multi-page write atomicity without 
>>>>>>>>> > > some sort of MVCC. There are a lot of common query patterns that 
>>>>>>>>> > > don't involve paging though, so it's not like the benefit of 
>>>>>>>>> > > fixing write atomicity would only apply to a small subset of 
>>>>>>>>> > > carefully crafted queries or something.
>>>>>>>>> >
>>>>>>>>> > Sure, it'll work a lot, but we don't say "partition level write 
>>>>>>>>> > atomicity some of the time".  We say guarantee.  From the CEP:
>>>>>>>>> >
>>>>>>>>> > > In the case of read repair, since we are only reading and 
>>>>>>>>> > > correcting the parts of a partition that we're reading and not 
>>>>>>>>> > > the entire contents of a partition on each read, read repair can 
>>>>>>>>> > > break our guarantee on partition level write atomicity. This 
>>>>>>>>> > > approach also prevents meeting the monotonic read requirement for 
>>>>>>>>> > > witness replicas, which has significantly limited its usefulness.
>>>>>>>>> >
>>>>>>>>> > I point this out because it's not well known, and we make a 
>>>>>>>>> > guarantee that isn't true, and while the CEP will reduce the number 
>>>>>>>>> > of cases in which we violate the guarantee, we will still have 
>>>>>>>>> > known edge cases that it doesn't hold up.  So we should stop saying 
>>>>>>>>> > it.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston 
>>>>>>>>> > <beggles...@apple.com <mailto:beggles...@apple.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > Thanks Dimitry and Jon, answers below
>>>>>>>>> >
>>>>>>>>> > 1) Is a single separate commit log expected to be created for all 
>>>>>>>>> > tables with the new replication type?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > The plan is to still have a single commit log, but only index 
>>>>>>>>> > mutations with a mutation id.
>>>>>>>>> >
>>>>>>>>> > 2) What is a granularity of storing mutation ids in memtable, is it 
>>>>>>>>> > per cell?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > It would be per-partition
>>>>>>>>> >
>>>>>>>>> > 3) If we update the same row multiple times while it is in a 
>>>>>>>>> > memtable - are all mutation ids appended to a kind of collection?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > They would yes. We might be able to do something where we stop 
>>>>>>>>> > tracking mutations that have been superseded by newer mutations 
>>>>>>>>> > (same cells, higher timestamps), but I suspect that would be more 
>>>>>>>>> > trouble than it's worth and would be out of scope for v1.
>>>>>>>>> >
>>>>>>>>> > 4) What is the expected size of a single id?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte 
>>>>>>>>> > hlc
>>>>>>>>> >
>>>>>>>>> > 5) Do we plan to support multi-table batches (single or 
>>>>>>>>> > multi-partition) for this replication type?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > This is intended to support all existing features, however the 
>>>>>>>>> > tracking only happens at the mutation level, so the different 
>>>>>>>>> > mutations coming out of a multi-partition batch would all be 
>>>>>>>>> > tracked individually
>>>>>>>>> >
>>>>>>>>> > So even without repair mucking things up, we're unable to fulfill 
>>>>>>>>> > this promise except under the specific, ideal circumstance of 
>>>>>>>>> > querying a partition with only 1 page.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > It's true that we can't offer multi-page write atomicity without 
>>>>>>>>> > some sort of MVCC. There are a lot of common query patterns that 
>>>>>>>>> > don't involve paging though, so it's not like the benefit of fixing 
>>>>>>>>> > write atomicity would only apply to a small subset of carefully 
>>>>>>>>> > crafted queries or something.
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> >
>>>>>>>>> > Blake
>>>>>>>>> >
>>>>>>>>> > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > Very cool!  I'll need to spent some time reading this over.  One 
>>>>>>>>> > thing I did notice is this:
>>>>>>>>> >
>>>>>>>>> > > Cassandra promises partition level write atomicity. This means 
>>>>>>>>> > > that, although writes are eventually consistent, a given write 
>>>>>>>>> > > will either be visible or not visible. You're not supposed to see 
>>>>>>>>> > > a partially applied write. However, read repair and short read 
>>>>>>>>> > > protection can both "tear" mutations. In the case of read repair, 
>>>>>>>>> > > this is because the data resolver only evaluates the data 
>>>>>>>>> > > included in the client read. So if your read only covers a 
>>>>>>>>> > > portion of a write that didn't reach a quorum, only that portion 
>>>>>>>>> > > will be repaired, breaking write atomicity.
>>>>>>>>> >
>>>>>>>>> > Unfortunately there's more issues with this than just repair.  
>>>>>>>>> > Since we lack a consistency mechanism like MVCC while paginating, 
>>>>>>>>> > it's possible to do the following:
>>>>>>>>> >
>>>>>>>>> > thread A: reads a partition P with 10K rows, starts by reading the 
>>>>>>>>> > first page
>>>>>>>>> > thread B: another thread writes a batch to 2 rows in partition P, 
>>>>>>>>> > one on page 1, another on page 2
>>>>>>>>> > thread A: reads the second page of P which has the mutation.
>>>>>>>>> >
>>>>>>>>> > I've worked with users who have been surprised by this behavior, 
>>>>>>>>> > because pagination happens transparently.
>>>>>>>>> >
>>>>>>>>> > So even without repair mucking things up, we're unable to fulfill 
>>>>>>>>> > this promise except under the specific, ideal circumstance of 
>>>>>>>>> > querying a partition with only 1 page.
>>>>>>>>> >
>>>>>>>>> > Jon
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston 
>>>>>>>>> > <beggles...@apple.com <mailto:beggles...@apple.com>> wrote:
>>>>>>>>> >
>>>>>>>>> > Hello dev@,
>>>>>>>>> >
>>>>>>>>> > We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>>>>>>>>> > community. CEP-45 proposes adding a replication mechanism to track 
>>>>>>>>> > and reconcile individual mutations, as well as processes to 
>>>>>>>>> > actively reconcile missing mutations.
>>>>>>>>> >
>>>>>>>>> > For keyspaces with mutation tracking enabled, the immediate 
>>>>>>>>> > benefits of this CEP are:
>>>>>>>>> > * reduced replication lag with a continuous background 
>>>>>>>>> > reconciliation process
>>>>>>>>> > * eliminate the disk load caused by repair merkle tree calculation
>>>>>>>>> > * eliminate repair overstreaming
>>>>>>>>> > * reduce disk load of reads on cluster to close to 1/CL
>>>>>>>>> > * fix longstanding mutation atomicity issues caused by read repair 
>>>>>>>>> > and short read protection
>>>>>>>>> >
>>>>>>>>> > Additionally, although it's outside the scope of this CEP, mutation 
>>>>>>>>> > tracking would enable:
>>>>>>>>> > * completion of witness replicas / transient replication, making 
>>>>>>>>> > the feature usable for all workloads
>>>>>>>>> > * lightweight witness only datacenters
>>>>>>>>> >
>>>>>>>>> > The CEP is linked here: 
>>>>>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>>>>>> >  but please keep the discussion on the dev list.
>>>>>>>>> >
>>>>>>>>> > Thanks!
>>>>>>>>> >
>>>>>>>>> > Blake Eggleston
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> http://twitter.com/tjake
>>>>>>> 
>>>>> 
>>> 
>

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to