That's an interesting idea. Basically allow for a window of uncertainty between 
the memtable and log and merge mutations within that window directly into the 
response. It sounds like something that could work.

I'll have to think about how not embedding id info into the storage layer might 
interact with the sstable silo requirements we have for reconciliation 
(basically the same thing we do for incremental repair), but  it's likely you 
could do something similar there as well.

> On Jan 18, 2025, at 12:16 PM, Benedict <bened...@apache.org> wrote:
> 
> That’s great to hear, I had thought the goal for embedding this information 
> in sstables was that the log could be truncated. If not, is the below snippet 
> the main motivation?
> 
>> For the nodes returning data _and_ mutation ids, the data and mutation ids 
>> need to describe each other exactly. If the data returned is missing data 
>> the mutation ids say are there, or has data the mutation ids say aren't, 
>> you'll have a read correctness issue.
> 
> 
> If so, I don’t think this is really a problem and we should perhaps 
> reconsider. With the magic of LSM we can merge redundant information from the 
> log into our read response, so we only need to be sure we know a point in the 
> log before which data must be in memtables. Anything after that point might 
> or might not be, and can simply be merged into the read response (potentially 
> redundantly).
> 
> This would seem to fall neatly into the reconciliation read path anyway; we 
> are looking for any data in a (local or remote) journal that we haven’t 
> written to the data store yet. If it isn’t known to be durable at a majority 
> then we have to perform a distributed write of the mutation. It doesn’t seem 
> like we need to do anything particularly special? 
> 
> We can wait until we have the total set of mutations to merge and then we 
> have our complete and consistent read response
> 
>> wouldn't be a bad idea to write most recent mutation id to a table every few 
>> seconds asynchronously
> 
> For accord we will write reservation records in advance so we can guarantee 
> we don’t go backwards. That is, we will periodically declare a point eg 10s 
> in the future that on restart we will have to first let elapse if we’re 
> behind.
> 
>> On 18 Jan 2025, at 18:31, Blake Eggleston <beggles...@apple.com> wrote:
>> 
>> No, mutations are kept intact. If a node is missing a multi-table mutation, 
>> it will receive the entire mutation on reconciliation.
>> 
>> Regarding HLCs, I vaguely remember hearing about a paxos outage maybe 9-10 
>> years ago that was related to a leap hour or leap second or something 
>> causing clocks to not behave as expected and ballots to be created slightly 
>> in the past. There may be some rare edge cases we're not thinking about and 
>> it wouldn't be a bad idea to write most recent mutation id to a table every 
>> few seconds asynchronously so we don't create a giant mess if we restart 
>> during them.
>> 
>>> On Jan 18, 2025, at 2:18 AM, Benedict <bened...@apache.org> wrote:
>>> 
>>> Does this approach potentially fail to guarantee multi table atomicity? If 
>>> we’re reconciling mutation ids separately per table, an atomic batch write 
>>> might get reconciled for one table but not another? I know that atomic 
>>> batch updates on a single partition key to multiple tables is an important 
>>> property for some users (though, read repair suffers this same problem - 
>>> but it would be a real shame not to close this gap while we’re fixing our 
>>> semantics, so we’re left only with paging isolation to contend with in 
>>> future)
>>> 
>>> Regarding unique HLCs Jon, before we go to prod in any cluster we’ll want 
>>> Accord to guarantee HLCs are unique, so we’ll probably have a journal 
>>> record reserve a batch of HLCs in advance, so we know what HLC it is safe 
>>> to reset to on restart. I’m sure this work can use the same feature, though 
>>> I agree with Blake it’s likely an unrealistic case in anything but 
>>> adversarial test scenarios.
>>> 
>>>> On 17 Jan 2025, at 22:52, Blake Eggleston <beggles...@apple.com> wrote:
>>>> 
>>>> 
>>>> Hi Jon, thanks for the excellent questions, answers below
>>>> 
>>>>> Write Path - for recovery, how does a node safely recover the highest 
>>>>> hybrid logical clock it has issued? Checking the last entry in the
>>>>> addressable log is insufficient unless we ensure every individual update 
>>>>> is durable, rather than batched/periodic. Something like leasing to an 
>>>>> upper bound could work.
>>>> 
>>>> It doesn't. We assume that the time it takes to restart will prevent 
>>>> issuing ids from the (logical) past. The HLC currently uses time in 
>>>> milliseconds, and multiplies that into microseconds. So as long as a given 
>>>> node is coordinating less than 1,000,000 writes a second and takes more 
>>>> than a second to startup, that shouldn't be possible.
>>>> 
>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they 
>>>>> map to mutated partitions, or is it a multimap of partitions to mutation 
>>>>> id? (question is motivated by not understanding how they are used after 
>>>>> log truncation and during bootstrap).
>>>> 
>>>> It's basically a map of partition keys to a set of mutation ids that are 
>>>> represented by that sstable. Mutation ids can't belong to more than a 
>>>> single partition key per table, so no multimap. After full reconciliation 
>>>> / log truncation, the ids are not used and can be removed on compaction. 
>>>> The non-reconciled log truncation idea discussed in the CEP seems like it 
>>>> will go away in favor of partial/cohort reconciliations. They're included 
>>>> in the sstable in lieu of including a second log index mapping keys to 
>>>> mutations ids on the log, although it may have other uses, such as fixing 
>>>> mutation atomicity across pages. 
>>>> 
>>>> What's not stated explicitly in the CEP (since I only realized it once I 
>>>> started prototyping) is that embedding the mutation ids in the storage 
>>>> layer solves a concurrency issue on the read path. For the nodes returning 
>>>> data _and_ mutation ids, the data and mutation ids need to describe each 
>>>> other exactly. If the data returned is missing data the mutation ids say 
>>>> are there, or has data the mutation ids say aren't, you'll have a read 
>>>> correctness issue. Since appending to the commit log and updating the 
>>>> memtable aren't really synchronized from the perspective of read 
>>>> visibility, putting the ids in the memtable on write solves this issue 
>>>> while preventing having to change how commit log / memtable concurrency 
>>>> works. Including the ids in the sstable isn't strictly necessary to fix 
>>>> the concurrency issue, but is convenient.
>>>> 
>>>>> Log Reconciliation - how is this scheduled within a replica group? Are 
>>>>> there any interactions/commonality with CEP-37 the unified repair 
>>>>> scheduler?
>>>> 
>>>> It's kind of hand wavy at the moment tbh. If CEP-37 meets our scheduling 
>>>> needs and is ready in time, it would be great to not have to reinvent it. 
>>>> However, the read path will be a lot more sensitive to unreconciled data 
>>>> that it is to unrepaired data, so the 2 systems may end up having 
>>>> different enough requirements that we have to do something separate.
>>>> 
>>>>> Cohort reconciliation
>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there 
>>>>> restrictions about how many cohorts an instance belongs to (one at a 
>>>>> time)? What determines the membership of a cohort, is it agreed as part 
>>>>> of running the partial reconciliation? Other members of the cohort may be 
>>>>> able to see a different subset of the nodes e.g. network misconfiguration 
>>>>> with three DCs where one DC is missing routing to another.
>>>>> - I assume the cohort reconciliation id is reused for subsequent partial 
>>>>> reconciliations only if the cohort members remain the same.
>>>> 
>>>> Cohorts are basically the nodes that can talk to each other. The cohort 
>>>> reconciliation has the same sort of mutation low bound logic as full 
>>>> reconciliations, so a given node/range combo can only belong to a single 
>>>> cohort at a time, and that's determined as part of the reconciliation 
>>>> setup process. The cohort id is reused for subsequent partial 
>>>> reconciliations so long as the members remain the same. This lets us 
>>>> compact data from the cohort together.
>>>> 
>>>>> - Are the reconciled mutations in the addressable log rewritten under the 
>>>>> cohort reconciliation id, or is the reference to them updated?
>>>> 
>>>> So for basic mutation tracking, the log entries are removed and you're 
>>>> left with an sstable silo, like pending repairs. For instance, in cases 
>>>> where you have a node down for a week, you don't want to accumulate a 
>>>> weeks worth of data and a weeks worth of logs. In the future, for 
>>>> witnesses where you don't have sstables, it's less clear. Maybe it will be 
>>>> better to keep a weeks worth of logs around, maybe it will be better to 
>>>> periodically materialize the cohort log data into sstables.
>>>> 
>>>>> - When the partition heals, if you process a read request that contains a 
>>>>> cohort reconciliation id, is there a risk that you have to transfer large 
>>>>> amounts of data before you can process, or does the addressable log allow 
>>>>> filtering by partition?
>>>> 
>>>> Yeah that's a risk. We could probably determine during the read that a 
>>>> given cohort does not contain a key for a read, but if it does, you'll 
>>>> have to wait. The reads themselves shouldn't be initiating reconciliation 
>>>> for the cohorts though, nodes will start exchanging cohort data as soon as 
>>>> they're able to connect to a previously unreachable node. I think read 
>>>> speculation will help here, and we may also be able to do something where 
>>>> we pull in just the data we need to the read to minimize impact on 
>>>> availability while maintaining read monotonicity.
>>>> 
>>>>> - Should the code be structured that cohort reconciliations are the 
>>>>> expected case, and as an optimization if all replicas are in the cohort, 
>>>>> then they
>>>>> can bump the lower id.
>>>> 
>>>> That's not a bad idea, both processes will have a lot in common.
>>>> 
>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a 
>>>>> single host (the initiator?) or do they have a different structure?
>>>> 
>>>> 
>>>> I'm not sure, I'd kind of assumed we'd just call UUID.randomUUID everytime 
>>>> the cohort changed.
>>>> 
>>>>> Log truncation - if log truncation occurs and mutations come from 
>>>>> sstables, will the mutations be the actual logged mutation (seems 
>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that 
>>>>> represent the current state in the sstable? If so, would the inclusion of 
>>>>> later mutations after the mutation/cohort id in that partition cause any 
>>>>> issues with reconciliation? (I see there's a hint about this in the 
>>>>> bootstrap section below)
>>>> 
>>>> So the log truncation stuff will probably go away in favor of cohort 
>>>> reconciliation. The idea though was that yeah, you'd have a sort of pseudo 
>>>> multi-mutation (assuming there are multiple ids represented) created from 
>>>> the sstable partition. Inclusion of later mutations shouldn't cause any 
>>>> problems. Everything should be commutative so long as we're not purging 
>>>> tombstones (which we won't if the data isn't fully reconciled).
>>>> 
>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? 
>>>>> Preserving that would make it possible to switch between strategies,
>>>>> it doesn't seem that complex, but maybe I'm missing something.
>>>> 
>>>> 
>>>> Yes, keeping that consistent and easy to migrate is a goal.
>>>> 
>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM 
>>>>> currently handles that. Would it need to be added to make mutation 
>>>>> tracking work? Where would the metadata be stored to indicate preferred 
>>>>> sources for missing mutations? Would that also extend to nodes that have 
>>>>> had to perform log truncation?
>>>> 
>>>> 
>>>> That's a really good question, I hadn't thought of that. It would be nice 
>>>> if RF changes got the same pending/streaming treatment that token range 
>>>> changes did. Not sure how difficult it would be to add that for at least 
>>>> tables that are using mutation tracking. Using the normal add/repair 
>>>> workflow we do now would probably be workable though, and would have the 
>>>> advantage of the coordinator being able to detect and exclude nodes that 
>>>> haven't received data for their new ranges though.
>>>> 
>>>>> Compaction - how are the mutation ids in sstable metadata handled when 
>>>>> multiple sstables are compacted, particularly with something like
>>>>> range aware writers or when splitting the output over multiple 
>>>>> size-bounded sstables.  A simple union could expand the number
>>>>> of sstables to consider after log truncation.
>>>> 
>>>> On compaction the ids for a partition would be merged, but ids that have 
>>>> been reconciled are also removed. I'm not sure if we split partitions 
>>>> across multiple sstables on compaction though. I suppose it's possible, 
>>>> though I don't know if it would have an impact if the log truncation part 
>>>> of the CEP ends up going away.
>>>> 
>>>> Thanks,
>>>> 
>>>> Blake
>>>> 
>>>>> On Jan 17, 2025, at 9:27 AM, Jon Meredith <jonmered...@apache.org> wrote:
>>>>> 
>>>>> I had another read through for the CEP and had some follow up 
>>>>> questions/thoughts.
>>>>> 
>>>>> Write Path - for recovery, how does a node safely recover the highest 
>>>>> hybrid logical clock it has issued? Checking the last entry in the
>>>>> addressable log is insufficient unless we ensure every individual update 
>>>>> is durable, rather than batched/periodic. Something like leasing to an 
>>>>> upper bound could work.
>>>>> 
>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they 
>>>>> map to mutated partitions, or is it a multimap of partitions to mutation 
>>>>> id? (question is motivated by not understanding how they are used after 
>>>>> log truncation and during bootstrap).
>>>>> 
>>>>> Log Reconciliation - how is this scheduled within a replica group? Are 
>>>>> there any interactions/commonality with CEP-37 the unified repair 
>>>>> scheduler?
>>>>> 
>>>>> Cohort reconciliation
>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there 
>>>>> restrictions about how many cohorts an instance belongs to (one at a 
>>>>> time)? What determines the membership of a cohort, is it agreed as part 
>>>>> of running the partial reconciliation? Other members of the cohort may be 
>>>>> able to see a different subset of the nodes e.g. network misconfiguration 
>>>>> with three DCs where one DC is missing routing to another.
>>>>> - I assume the cohort reconciliation id is reused for subsequent partial 
>>>>> reconciliations only if the cohort members remain the same.
>>>>> - Are the reconciled mutations in the addressable log rewritten under the 
>>>>> cohort reconciliation id, or is the reference to them updated?
>>>>> - When the partition heals, if you process a read request that contains a 
>>>>> cohort reconciliation id, is there a risk that you have to transfer large 
>>>>> amounts of data before you can process, or does the addressable log allow 
>>>>> filtering by partition?
>>>>> - Should the code be structured that cohort reconciliations are the 
>>>>> expected case, and as an optimization if all replicas are in the cohort, 
>>>>> then they
>>>>> can bump the lower id.
>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a 
>>>>> single host (the initiator?) or do they have a different structure?
>>>>> 
>>>>> Log truncation - if log truncation occurs and mutations come from 
>>>>> sstables, will the mutations be the actual logged mutation (seems 
>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that 
>>>>> represent the current state in the sstable? If so, would the inclusion of 
>>>>> later mutations after the mutation/cohort id in that partition cause any 
>>>>> issues with reconciliation? (I see there's a hint about this in the 
>>>>> bootstrap section below)
>>>>> 
>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? 
>>>>> Preserving that would make it possible to switch between strategies,
>>>>> it doesn't seem that complex, but maybe I'm missing something.
>>>>> 
>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM 
>>>>> currently handles that. Would it need to be added to make mutation 
>>>>> tracking work? Where would the metadata be stored to indicate preferred 
>>>>> sources for missing mutations? Would that also extend to nodes that have 
>>>>> had to perform log truncation?
>>>>> 
>>>>> Additional concerns
>>>>> 
>>>>> Compaction - how are the mutation ids in sstable metadata handled when 
>>>>> multiple sstables are compacted, particularly with something like
>>>>> range aware writers or when splitting the output over multiple 
>>>>> size-bounded sstables.  A simple union could expand the number
>>>>> of sstables to consider after log truncation.
>>>>> 
>>>>> Thanks!
>>>>> Jon
>>>>> 
>>>>> On Thu, Jan 16, 2025 at 11:51 AM Blake Eggleston <beggles...@apple.com 
>>>>> <mailto:beggles...@apple.com>> wrote:
>>>>>> I’m not sure Josh. Jon brought up paging and the documentation around it 
>>>>>> because our docs say we provide mutation level atomicity, but we also 
>>>>>> provide drivers that page transparently. So from the user’s perspective, 
>>>>>> a single “query” breaks this guarantee unpredictably. Occasional 
>>>>>> exceptions with a clear message explaining what is happening, why, and 
>>>>>> how to fix it is going to be less confusing that tracking down 
>>>>>> application misbehavior caused by this.
>>>>>> 
>>>>>> It would also be easy to make the time horizon for paging constant and 
>>>>>> configurable (keep at least 20 minutes of logs, for instance), that 
>>>>>> would at least provide a floor of predictability.
>>>>>> 
>>>>>>> On Jan 16, 2025, at 10:08 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>> <mailto:jmcken...@apache.org>> wrote:
>>>>>>> 
>>>>>>>> The other issue is that there isn’t a time bound on the paging 
>>>>>>>> payload, so if the application is taking long enough between pages 
>>>>>>>> that the log has been truncated, we’d have to throw an exception.
>>>>>>> My hot-take is that this relationship between how long you're taking to 
>>>>>>> page, how much data you're processing / getting back, and ingest / 
>>>>>>> flushing frequency all combined leading to unpredictable exceptions 
>>>>>>> would be a bad default from a UX perspective compared to a default of 
>>>>>>> "a single page of data has atomicity; multiple pages do not". Maybe 
>>>>>>> it's just because that's been our default for so long.
>>>>>>> 
>>>>>>> The simplicity of having a flag that's "don't make my pages atomic and 
>>>>>>> they always return vs. make my pages atomic and throw exceptions if the 
>>>>>>> metadata I need is yoinked while I page" is pretty attractive to me.
>>>>>>> 
>>>>>>> Really interesting thought, using these logs as "partial MVCC" while 
>>>>>>> they're available specifically for what could/should be a very tight 
>>>>>>> timeline use-case (paging).
>>>>>>> 
>>>>>>> On Thu, Jan 16, 2025, at 12:41 PM, Jake Luciani wrote:
>>>>>>>> This is very cool!
>>>>>>>> 
>>>>>>>> I have done a POC that was similar but more akin to Aurora paper
>>>>>>>> whereby the commitlog itself would repair itself from peers
>>>>>>>> proactively using the seekable commitlog.
>>>>>>>> 
>>>>>>>> Can you explain the reason you prefer to reconcile on read?  Having a
>>>>>>>> consistent commitlog would solve so many problems like CDC, PITR, MVs
>>>>>>>> etc.
>>>>>>>> 
>>>>>>>> Jake
>>>>>>>> 
>>>>>>>> On Thu, Jan 16, 2025 at 12:13 PM Blake Eggleston <beggles...@apple.com 
>>>>>>>> <mailto:beggles...@apple.com>> wrote:
>>>>>>>> >
>>>>>>>> > I’ve been thinking about the paging atomicity issue. I think it 
>>>>>>>> > could be fixed with mutation tracking and without having to support 
>>>>>>>> > full on MVCC.
>>>>>>>> >
>>>>>>>> > When we reach a page boundary, we can send the highest mutation id 
>>>>>>>> > we’ve seen for the partition we reached the paging boundary on. When 
>>>>>>>> > we request another page, we send that high water mark back as part 
>>>>>>>> > of the paging request.
>>>>>>>> >
>>>>>>>> > Each sstable and memtable contributing to the read responses will 
>>>>>>>> > know which mutations it has in each partition, so if we encounter 
>>>>>>>> > one that has a higher id than we saw in the last page, we 
>>>>>>>> > reconstitute its data from mutations in the log, excluding the newer 
>>>>>>>> > mutations., or exclude it entirely if it only has newer mutations.
>>>>>>>> >
>>>>>>>> > This isn’t free of course. When paging through large partitions, 
>>>>>>>> > each page request becomes more likely to encounter mutations it 
>>>>>>>> > needs to exclude, and it’s unclear how expensive that will be. 
>>>>>>>> > Obviously it’s more expensive to reconstitute vs read, but on the 
>>>>>>>> > other hand, only a single replica will be reading any data, so on 
>>>>>>>> > balance it would still probably be less work for the cluster than 
>>>>>>>> > running the normal read path.
>>>>>>>> >
>>>>>>>> > The other issue is that there isn’t a time bound on the paging 
>>>>>>>> > payload, so if the application is taking long enough between pages 
>>>>>>>> > that the log has been truncated, we’d have to throw an exception.
>>>>>>>> >
>>>>>>>> > This is mostly just me brainstorming though, and wouldn’t be 
>>>>>>>> > something that would be in a v1.
>>>>>>>> >
>>>>>>>> > On Jan 9, 2025, at 2:07 PM, Blake Eggleston <beggles...@apple.com 
>>>>>>>> > <mailto:beggles...@apple.com>> wrote:
>>>>>>>> >
>>>>>>>> > So the ids themselves are in the memtable and are accessible as soon 
>>>>>>>> > as they’re written, and need to be for the read path to work.
>>>>>>>> >
>>>>>>>> > We’re not able to reconcile the ids until we can guarantee that they 
>>>>>>>> > won’t be merged with unreconciled data, that’s why they’re flushed 
>>>>>>>> > before reconciliation.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Jan 9, 2025, at 10:53 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>>> > <mailto:jmcken...@apache.org>> wrote:
>>>>>>>> >
>>>>>>>> > We also can't remove mutation ids until they've been reconciled, so 
>>>>>>>> > in the simplest implementation, we'd need to flush a memtable before 
>>>>>>>> > reconciling, and there would never be a situation where you have 
>>>>>>>> > purgeable mutation ids in the memtable.
>>>>>>>> >
>>>>>>>> > Got it. So effectively that data would be unreconcilable until such 
>>>>>>>> > time as it was flushed and you had those id's to work with in the 
>>>>>>>> > sstable metadata, and the process can force a flush to reconcile in 
>>>>>>>> > those cases where you have mutations in the MT/CL combo that are 
>>>>>>>> > transiently not subject to the reconciliation process due to that 
>>>>>>>> > log being purged. Or you flush before purging the log, assuming 
>>>>>>>> > we're not changing MT data structures to store id (don't recall if 
>>>>>>>> > that's specified in the CEP...)
>>>>>>>> >
>>>>>>>> > Am I grokking that?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote:
>>>>>>>> >
>>>>>>>> > Hi Josh,
>>>>>>>> >
>>>>>>>> > You can think of reconciliation as analogous to incremental repair. 
>>>>>>>> > Like incremental repair, you can't mix reconciled/unreconciled data 
>>>>>>>> > without causing problem. We also can't remove mutation ids until 
>>>>>>>> > they've been reconciled, so in the simplest implementation, we'd 
>>>>>>>> > need to flush a memtable before reconciling, and there would never 
>>>>>>>> > be a situation where you have purgeable mutation ids in the memtable.
>>>>>>>> >
>>>>>>>> > The production version of this will be more sophisticated about how 
>>>>>>>> > it keeps this data separate to it can reliably support automatic 
>>>>>>>> > reconciliation cadences that are higher than what you can do with 
>>>>>>>> > incremental repair today, but that’s the short answer.
>>>>>>>> >
>>>>>>>> > It's also likely that the concept of log truncation will be removed 
>>>>>>>> > in favor of going straight to cohort reconciliation in longer 
>>>>>>>> > outages.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Blake
>>>>>>>> >
>>>>>>>> > On Jan 9, 2025, at 8:27 AM, Josh McKenzie <jmcken...@apache.org 
>>>>>>>> > <mailto:jmcken...@apache.org>> wrote:
>>>>>>>> >
>>>>>>>> > Question re: Log Truncation (emphasis mine):
>>>>>>>> >
>>>>>>>> > When the cluster is operating normally, logs entries can be 
>>>>>>>> > discarded once they are older than the last reconciliation time of 
>>>>>>>> > their respective ranges. To prevent unbounded log growth during 
>>>>>>>> > outages however, logs are still deleted once they reach some 
>>>>>>>> > configurable amount of time (maybe 2 hours by default?). From here, 
>>>>>>>> > all reconciliation processes behave the same as before, but they use 
>>>>>>>> > mutation ids stored in sstable metadata for listing mutation ids and 
>>>>>>>> > transmit missing partitions.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > What happens when / if we have data living in a memtable past that 
>>>>>>>> > time threshold that hasn't yet been flushed to an sstable? i.e. low 
>>>>>>>> > velocity table or a really tightly configured "purge my mutation 
>>>>>>>> > reconciliation logs at time bound X".
>>>>>>>> >
>>>>>>>> > On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote:
>>>>>>>> >
>>>>>>>> > Is this something we can disable? I can see scenarios where this 
>>>>>>>> > would be strictly and severely worse then existing scenarios where 
>>>>>>>> > we don't need repairs. ie short time window data, millions of writes 
>>>>>>>> > a second that get thrown out after a few hours. If that data is 
>>>>>>>> > small partitions we are nearly doubling the disk use for things we 
>>>>>>>> > don't care about.
>>>>>>>> >
>>>>>>>> > Chris
>>>>>>>> >
>>>>>>>> > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com 
>>>>>>>> > <mailto:cclive1...@gmail.com>> wrote:
>>>>>>>> >
>>>>>>>> > After a brief understanding, there are 2 questions from me, If I ask 
>>>>>>>> > something inappropriate, please feel free to correct me :
>>>>>>>> >
>>>>>>>> > 1、 Does it support changing the table to support mutation tracking 
>>>>>>>> > through ALTER TABLE if it does not support mutation tracking before?
>>>>>>>> > 2、
>>>>>>>> >
>>>>>>>> > Available options for tables are keyspace, legacy, and logged, with 
>>>>>>>> > the default being keyspace, which inherits the keyspace setting
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Do you think that keyspace_inherit  (or other keywords that clearly 
>>>>>>>> > explain the behavior ) is better than name keyspace ?
>>>>>>>> > In addition, is legacy appropriate? Because this is a new feature, 
>>>>>>>> > there is only the behavior of turning it on and off. Turning it off 
>>>>>>>> > means not using this feature.
>>>>>>>> > If the keyword legacy is used, from the user's perspective, is it 
>>>>>>>> > using an old version of the mutation tracking? Similar to the 
>>>>>>>> > relationship between SAI and native2i.
>>>>>>>> >
>>>>>>>> > Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>> > <mailto:j...@rustyrazorblade.com>> 于2025年1月9日周四 06:14写道:
>>>>>>>> >
>>>>>>>> > JD, the fact that pagination is implemented as multiple queries is a 
>>>>>>>> > design choice.  A user performs a query with fetch size 1 or 100 and 
>>>>>>>> > they will get different behavior.
>>>>>>>> >
>>>>>>>> > I'm not asking for anyone to implement MVCC.  I'm asking for the 
>>>>>>>> > docs around this to be correct.  We should not use the term 
>>>>>>>> > guarantee here, it's best effort.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan 
>>>>>>>> > <jeremiah.jor...@gmail.com <mailto:jeremiah.jor...@gmail.com>> wrote:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Your pagination case is not a violation of any guarantees Cassandra 
>>>>>>>> > makes. It has never made guarantees across multiple queries.
>>>>>>>> > Trying to have MVCC/consistent data across multiple queries is a 
>>>>>>>> > very different issue/problem from this CEP.  If you want to have a 
>>>>>>>> > discussion about MVCC I suggest creating a new thread.
>>>>>>>> >
>>>>>>>> > -Jeremiah
>>>>>>>> >
>>>>>>>> > On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>> >
>>>>>>>> > 
>>>>>>>> > > It's true that we can't offer multi-page write atomicity without 
>>>>>>>> > > some sort of MVCC. There are a lot of common query patterns that 
>>>>>>>> > > don't involve paging though, so it's not like the benefit of 
>>>>>>>> > > fixing write atomicity would only apply to a small subset of 
>>>>>>>> > > carefully crafted queries or something.
>>>>>>>> >
>>>>>>>> > Sure, it'll work a lot, but we don't say "partition level write 
>>>>>>>> > atomicity some of the time".  We say guarantee.  From the CEP:
>>>>>>>> >
>>>>>>>> > > In the case of read repair, since we are only reading and 
>>>>>>>> > > correcting the parts of a partition that we're reading and not the 
>>>>>>>> > > entire contents of a partition on each read, read repair can break 
>>>>>>>> > > our guarantee on partition level write atomicity. This approach 
>>>>>>>> > > also prevents meeting the monotonic read requirement for witness 
>>>>>>>> > > replicas, which has significantly limited its usefulness.
>>>>>>>> >
>>>>>>>> > I point this out because it's not well known, and we make a 
>>>>>>>> > guarantee that isn't true, and while the CEP will reduce the number 
>>>>>>>> > of cases in which we violate the guarantee, we will still have known 
>>>>>>>> > edge cases that it doesn't hold up.  So we should stop saying it.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com 
>>>>>>>> > <mailto:beggles...@apple.com>> wrote:
>>>>>>>> >
>>>>>>>> > Thanks Dimitry and Jon, answers below
>>>>>>>> >
>>>>>>>> > 1) Is a single separate commit log expected to be created for all 
>>>>>>>> > tables with the new replication type?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > The plan is to still have a single commit log, but only index 
>>>>>>>> > mutations with a mutation id.
>>>>>>>> >
>>>>>>>> > 2) What is a granularity of storing mutation ids in memtable, is it 
>>>>>>>> > per cell?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > It would be per-partition
>>>>>>>> >
>>>>>>>> > 3) If we update the same row multiple times while it is in a 
>>>>>>>> > memtable - are all mutation ids appended to a kind of collection?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > They would yes. We might be able to do something where we stop 
>>>>>>>> > tracking mutations that have been superseded by newer mutations 
>>>>>>>> > (same cells, higher timestamps), but I suspect that would be more 
>>>>>>>> > trouble than it's worth and would be out of scope for v1.
>>>>>>>> >
>>>>>>>> > 4) What is the expected size of a single id?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte 
>>>>>>>> > hlc
>>>>>>>> >
>>>>>>>> > 5) Do we plan to support multi-table batches (single or 
>>>>>>>> > multi-partition) for this replication type?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > This is intended to support all existing features, however the 
>>>>>>>> > tracking only happens at the mutation level, so the different 
>>>>>>>> > mutations coming out of a multi-partition batch would all be tracked 
>>>>>>>> > individually
>>>>>>>> >
>>>>>>>> > So even without repair mucking things up, we're unable to fulfill 
>>>>>>>> > this promise except under the specific, ideal circumstance of 
>>>>>>>> > querying a partition with only 1 page.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > It's true that we can't offer multi-page write atomicity without 
>>>>>>>> > some sort of MVCC. There are a lot of common query patterns that 
>>>>>>>> > don't involve paging though, so it's not like the benefit of fixing 
>>>>>>>> > write atomicity would only apply to a small subset of carefully 
>>>>>>>> > crafted queries or something.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Blake
>>>>>>>> >
>>>>>>>> > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com 
>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>>>> >
>>>>>>>> > Very cool!  I'll need to spent some time reading this over.  One 
>>>>>>>> > thing I did notice is this:
>>>>>>>> >
>>>>>>>> > > Cassandra promises partition level write atomicity. This means 
>>>>>>>> > > that, although writes are eventually consistent, a given write 
>>>>>>>> > > will either be visible or not visible. You're not supposed to see 
>>>>>>>> > > a partially applied write. However, read repair and short read 
>>>>>>>> > > protection can both "tear" mutations. In the case of read repair, 
>>>>>>>> > > this is because the data resolver only evaluates the data included 
>>>>>>>> > > in the client read. So if your read only covers a portion of a 
>>>>>>>> > > write that didn't reach a quorum, only that portion will be 
>>>>>>>> > > repaired, breaking write atomicity.
>>>>>>>> >
>>>>>>>> > Unfortunately there's more issues with this than just repair.  Since 
>>>>>>>> > we lack a consistency mechanism like MVCC while paginating, it's 
>>>>>>>> > possible to do the following:
>>>>>>>> >
>>>>>>>> > thread A: reads a partition P with 10K rows, starts by reading the 
>>>>>>>> > first page
>>>>>>>> > thread B: another thread writes a batch to 2 rows in partition P, 
>>>>>>>> > one on page 1, another on page 2
>>>>>>>> > thread A: reads the second page of P which has the mutation.
>>>>>>>> >
>>>>>>>> > I've worked with users who have been surprised by this behavior, 
>>>>>>>> > because pagination happens transparently.
>>>>>>>> >
>>>>>>>> > So even without repair mucking things up, we're unable to fulfill 
>>>>>>>> > this promise except under the specific, ideal circumstance of 
>>>>>>>> > querying a partition with only 1 page.
>>>>>>>> >
>>>>>>>> > Jon
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston 
>>>>>>>> > <beggles...@apple.com <mailto:beggles...@apple.com>> wrote:
>>>>>>>> >
>>>>>>>> > Hello dev@,
>>>>>>>> >
>>>>>>>> > We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>>>>>>>> > community. CEP-45 proposes adding a replication mechanism to track 
>>>>>>>> > and reconcile individual mutations, as well as processes to actively 
>>>>>>>> > reconcile missing mutations.
>>>>>>>> >
>>>>>>>> > For keyspaces with mutation tracking enabled, the immediate benefits 
>>>>>>>> > of this CEP are:
>>>>>>>> > * reduced replication lag with a continuous background 
>>>>>>>> > reconciliation process
>>>>>>>> > * eliminate the disk load caused by repair merkle tree calculation
>>>>>>>> > * eliminate repair overstreaming
>>>>>>>> > * reduce disk load of reads on cluster to close to 1/CL
>>>>>>>> > * fix longstanding mutation atomicity issues caused by read repair 
>>>>>>>> > and short read protection
>>>>>>>> >
>>>>>>>> > Additionally, although it's outside the scope of this CEP, mutation 
>>>>>>>> > tracking would enable:
>>>>>>>> > * completion of witness replicas / transient replication, making the 
>>>>>>>> > feature usable for all workloads
>>>>>>>> > * lightweight witness only datacenters
>>>>>>>> >
>>>>>>>> > The CEP is linked here: 
>>>>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>>>>> >  but please keep the discussion on the dev list.
>>>>>>>> >
>>>>>>>> > Thanks!
>>>>>>>> >
>>>>>>>> > Blake Eggleston
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> http://twitter.com/tjake
>>>>>> 
>>>> 
>> 

Reply via email to