Hi dev@, Looks like it's been about 10 days since the last message here. Are there any other comments before I put it up for a vote?
Thanks, Blake > On Jan 18, 2025, at 12:33 PM, Blake Eggleston <beggles...@apple.com> wrote: > > That's an interesting idea. Basically allow for a window of uncertainty > between the memtable and log and merge mutations within that window directly > into the response. It sounds like something that could work. > > I'll have to think about how not embedding id info into the storage layer > might interact with the sstable silo requirements we have for reconciliation > (basically the same thing we do for incremental repair), but it's likely you > could do something similar there as well. > >> On Jan 18, 2025, at 12:16 PM, Benedict <bened...@apache.org> wrote: >> >> That’s great to hear, I had thought the goal for embedding this information >> in sstables was that the log could be truncated. If not, is the below >> snippet the main motivation? >> >>> For the nodes returning data _and_ mutation ids, the data and mutation ids >>> need to describe each other exactly. If the data returned is missing data >>> the mutation ids say are there, or has data the mutation ids say aren't, >>> you'll have a read correctness issue. >> >> >> If so, I don’t think this is really a problem and we should perhaps >> reconsider. With the magic of LSM we can merge redundant information from >> the log into our read response, so we only need to be sure we know a point >> in the log before which data must be in memtables. Anything after that point >> might or might not be, and can simply be merged into the read response >> (potentially redundantly). >> >> This would seem to fall neatly into the reconciliation read path anyway; we >> are looking for any data in a (local or remote) journal that we haven’t >> written to the data store yet. If it isn’t known to be durable at a majority >> then we have to perform a distributed write of the mutation. It doesn’t seem >> like we need to do anything particularly special? >> >> We can wait until we have the total set of mutations to merge and then we >> have our complete and consistent read response >> >>> wouldn't be a bad idea to write most recent mutation id to a table every >>> few seconds asynchronously >> >> For accord we will write reservation records in advance so we can guarantee >> we don’t go backwards. That is, we will periodically declare a point eg 10s >> in the future that on restart we will have to first let elapse if we’re >> behind. >> >>> On 18 Jan 2025, at 18:31, Blake Eggleston <beggles...@apple.com> wrote: >>> >>> No, mutations are kept intact. If a node is missing a multi-table >>> mutation, it will receive the entire mutation on reconciliation. >>> >>> Regarding HLCs, I vaguely remember hearing about a paxos outage maybe 9-10 >>> years ago that was related to a leap hour or leap second or something >>> causing clocks to not behave as expected and ballots to be created slightly >>> in the past. There may be some rare edge cases we're not thinking about and >>> it wouldn't be a bad idea to write most recent mutation id to a table every >>> few seconds asynchronously so we don't create a giant mess if we restart >>> during them. >>> >>>> On Jan 18, 2025, at 2:18 AM, Benedict <bened...@apache.org> wrote: >>>> >>>> Does this approach potentially fail to guarantee multi table atomicity? If >>>> we’re reconciling mutation ids separately per table, an atomic batch write >>>> might get reconciled for one table but not another? I know that atomic >>>> batch updates on a single partition key to multiple tables is an important >>>> property for some users (though, read repair suffers this same problem - >>>> but it would be a real shame not to close this gap while we’re fixing our >>>> semantics, so we’re left only with paging isolation to contend with in >>>> future) >>>> >>>> Regarding unique HLCs Jon, before we go to prod in any cluster we’ll want >>>> Accord to guarantee HLCs are unique, so we’ll probably have a journal >>>> record reserve a batch of HLCs in advance, so we know what HLC it is safe >>>> to reset to on restart. I’m sure this work can use the same feature, >>>> though I agree with Blake it’s likely an unrealistic case in anything but >>>> adversarial test scenarios. >>>> >>>>> On 17 Jan 2025, at 22:52, Blake Eggleston <beggles...@apple.com> wrote: >>>>> >>>>> >>>>> Hi Jon, thanks for the excellent questions, answers below >>>>> >>>>>> Write Path - for recovery, how does a node safely recover the highest >>>>>> hybrid logical clock it has issued? Checking the last entry in the >>>>>> addressable log is insufficient unless we ensure every individual update >>>>>> is durable, rather than batched/periodic. Something like leasing to an >>>>>> upper bound could work. >>>>> >>>>> It doesn't. We assume that the time it takes to restart will prevent >>>>> issuing ids from the (logical) past. The HLC currently uses time in >>>>> milliseconds, and multiplies that into microseconds. So as long as a >>>>> given node is coordinating less than 1,000,000 writes a second and takes >>>>> more than a second to startup, that shouldn't be possible. >>>>> >>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they >>>>>> map to mutated partitions, or is it a multimap of partitions to mutation >>>>>> id? (question is motivated by not understanding how they are used after >>>>>> log truncation and during bootstrap). >>>>> >>>>> It's basically a map of partition keys to a set of mutation ids that are >>>>> represented by that sstable. Mutation ids can't belong to more than a >>>>> single partition key per table, so no multimap. After full reconciliation >>>>> / log truncation, the ids are not used and can be removed on compaction. >>>>> The non-reconciled log truncation idea discussed in the CEP seems like it >>>>> will go away in favor of partial/cohort reconciliations. They're included >>>>> in the sstable in lieu of including a second log index mapping keys to >>>>> mutations ids on the log, although it may have other uses, such as fixing >>>>> mutation atomicity across pages. >>>>> >>>>> What's not stated explicitly in the CEP (since I only realized it once I >>>>> started prototyping) is that embedding the mutation ids in the storage >>>>> layer solves a concurrency issue on the read path. For the nodes >>>>> returning data _and_ mutation ids, the data and mutation ids need to >>>>> describe each other exactly. If the data returned is missing data the >>>>> mutation ids say are there, or has data the mutation ids say aren't, >>>>> you'll have a read correctness issue. Since appending to the commit log >>>>> and updating the memtable aren't really synchronized from the perspective >>>>> of read visibility, putting the ids in the memtable on write solves this >>>>> issue while preventing having to change how commit log / memtable >>>>> concurrency works. Including the ids in the sstable isn't strictly >>>>> necessary to fix the concurrency issue, but is convenient. >>>>> >>>>>> Log Reconciliation - how is this scheduled within a replica group? Are >>>>>> there any interactions/commonality with CEP-37 the unified repair >>>>>> scheduler? >>>>> >>>>> It's kind of hand wavy at the moment tbh. If CEP-37 meets our scheduling >>>>> needs and is ready in time, it would be great to not have to reinvent it. >>>>> However, the read path will be a lot more sensitive to unreconciled data >>>>> that it is to unrepaired data, so the 2 systems may end up having >>>>> different enough requirements that we have to do something separate. >>>>> >>>>>> Cohort reconciliation >>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there >>>>>> restrictions about how many cohorts an instance belongs to (one at a >>>>>> time)? What determines the membership of a cohort, is it agreed as part >>>>>> of running the partial reconciliation? Other members of the cohort may >>>>>> be able to see a different subset of the nodes e.g. network >>>>>> misconfiguration with three DCs where one DC is missing routing to >>>>>> another. >>>>>> - I assume the cohort reconciliation id is reused for subsequent partial >>>>>> reconciliations only if the cohort members remain the same. >>>>> >>>>> Cohorts are basically the nodes that can talk to each other. The cohort >>>>> reconciliation has the same sort of mutation low bound logic as full >>>>> reconciliations, so a given node/range combo can only belong to a single >>>>> cohort at a time, and that's determined as part of the reconciliation >>>>> setup process. The cohort id is reused for subsequent partial >>>>> reconciliations so long as the members remain the same. This lets us >>>>> compact data from the cohort together. >>>>> >>>>>> - Are the reconciled mutations in the addressable log rewritten under >>>>>> the cohort reconciliation id, or is the reference to them updated? >>>>> >>>>> So for basic mutation tracking, the log entries are removed and you're >>>>> left with an sstable silo, like pending repairs. For instance, in cases >>>>> where you have a node down for a week, you don't want to accumulate a >>>>> weeks worth of data and a weeks worth of logs. In the future, for >>>>> witnesses where you don't have sstables, it's less clear. Maybe it will >>>>> be better to keep a weeks worth of logs around, maybe it will be better >>>>> to periodically materialize the cohort log data into sstables. >>>>> >>>>>> - When the partition heals, if you process a read request that contains >>>>>> a cohort reconciliation id, is there a risk that you have to transfer >>>>>> large amounts of data before you can process, or does the addressable >>>>>> log allow filtering by partition? >>>>> >>>>> Yeah that's a risk. We could probably determine during the read that a >>>>> given cohort does not contain a key for a read, but if it does, you'll >>>>> have to wait. The reads themselves shouldn't be initiating reconciliation >>>>> for the cohorts though, nodes will start exchanging cohort data as soon >>>>> as they're able to connect to a previously unreachable node. I think read >>>>> speculation will help here, and we may also be able to do something where >>>>> we pull in just the data we need to the read to minimize impact on >>>>> availability while maintaining read monotonicity. >>>>> >>>>>> - Should the code be structured that cohort reconciliations are the >>>>>> expected case, and as an optimization if all replicas are in the cohort, >>>>>> then they >>>>>> can bump the lower id. >>>>> >>>>> That's not a bad idea, both processes will have a lot in common. >>>>> >>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a >>>>>> single host (the initiator?) or do they have a different structure? >>>>> >>>>> >>>>> I'm not sure, I'd kind of assumed we'd just call UUID.randomUUID >>>>> everytime the cohort changed. >>>>> >>>>>> Log truncation - if log truncation occurs and mutations come from >>>>>> sstables, will the mutations be the actual logged mutation (seems >>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that >>>>>> represent the current state in the sstable? If so, would the inclusion >>>>>> of later mutations after the mutation/cohort id in that partition cause >>>>>> any issues with reconciliation? (I see there's a hint about this in the >>>>>> bootstrap section below) >>>>> >>>>> So the log truncation stuff will probably go away in favor of cohort >>>>> reconciliation. The idea though was that yeah, you'd have a sort of >>>>> pseudo multi-mutation (assuming there are multiple ids represented) >>>>> created from the sstable partition. Inclusion of later mutations >>>>> shouldn't cause any problems. Everything should be commutative so long as >>>>> we're not purging tombstones (which we won't if the data isn't fully >>>>> reconciled). >>>>> >>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? >>>>>> Preserving that would make it possible to switch between strategies, >>>>>> it doesn't seem that complex, but maybe I'm missing something. >>>>> >>>>> >>>>> Yes, keeping that consistent and easy to migrate is a goal. >>>>> >>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM >>>>>> currently handles that. Would it need to be added to make mutation >>>>>> tracking work? Where would the metadata be stored to indicate preferred >>>>>> sources for missing mutations? Would that also extend to nodes that have >>>>>> had to perform log truncation? >>>>> >>>>> >>>>> That's a really good question, I hadn't thought of that. It would be nice >>>>> if RF changes got the same pending/streaming treatment that token range >>>>> changes did. Not sure how difficult it would be to add that for at least >>>>> tables that are using mutation tracking. Using the normal add/repair >>>>> workflow we do now would probably be workable though, and would have the >>>>> advantage of the coordinator being able to detect and exclude nodes that >>>>> haven't received data for their new ranges though. >>>>> >>>>>> Compaction - how are the mutation ids in sstable metadata handled when >>>>>> multiple sstables are compacted, particularly with something like >>>>>> range aware writers or when splitting the output over multiple >>>>>> size-bounded sstables. A simple union could expand the number >>>>>> of sstables to consider after log truncation. >>>>> >>>>> On compaction the ids for a partition would be merged, but ids that have >>>>> been reconciled are also removed. I'm not sure if we split partitions >>>>> across multiple sstables on compaction though. I suppose it's possible, >>>>> though I don't know if it would have an impact if the log truncation part >>>>> of the CEP ends up going away. >>>>> >>>>> Thanks, >>>>> >>>>> Blake >>>>> >>>>>> On Jan 17, 2025, at 9:27 AM, Jon Meredith <jonmered...@apache.org> wrote: >>>>>> >>>>>> I had another read through for the CEP and had some follow up >>>>>> questions/thoughts. >>>>>> >>>>>> Write Path - for recovery, how does a node safely recover the highest >>>>>> hybrid logical clock it has issued? Checking the last entry in the >>>>>> addressable log is insufficient unless we ensure every individual update >>>>>> is durable, rather than batched/periodic. Something like leasing to an >>>>>> upper bound could work. >>>>>> >>>>>> SSTable Metadata - is this just a simple set of mutation ids, or do they >>>>>> map to mutated partitions, or is it a multimap of partitions to mutation >>>>>> id? (question is motivated by not understanding how they are used after >>>>>> log truncation and during bootstrap). >>>>>> >>>>>> Log Reconciliation - how is this scheduled within a replica group? Are >>>>>> there any interactions/commonality with CEP-37 the unified repair >>>>>> scheduler? >>>>>> >>>>>> Cohort reconciliation >>>>>> - Are the cohorts ad-hoc for each partial reconciliation, are there >>>>>> restrictions about how many cohorts an instance belongs to (one at a >>>>>> time)? What determines the membership of a cohort, is it agreed as part >>>>>> of running the partial reconciliation? Other members of the cohort may >>>>>> be able to see a different subset of the nodes e.g. network >>>>>> misconfiguration with three DCs where one DC is missing routing to >>>>>> another. >>>>>> - I assume the cohort reconciliation id is reused for subsequent partial >>>>>> reconciliations only if the cohort members remain the same. >>>>>> - Are the reconciled mutations in the addressable log rewritten under >>>>>> the cohort reconciliation id, or is the reference to them updated? >>>>>> - When the partition heals, if you process a read request that contains >>>>>> a cohort reconciliation id, is there a risk that you have to transfer >>>>>> large amounts of data before you can process, or does the addressable >>>>>> log allow filtering by partition? >>>>>> - Should the code be structured that cohort reconciliations are the >>>>>> expected case, and as an optimization if all replicas are in the cohort, >>>>>> then they >>>>>> can bump the lower id. >>>>>> - Are cohort ids issued the same way as regular mutation ids issued by a >>>>>> single host (the initiator?) or do they have a different structure? >>>>>> >>>>>> Log truncation - if log truncation occurs and mutations come from >>>>>> sstables, will the mutations be the actual logged mutation (seems >>>>>> unlikely), or will Cassandra have to construct pseudo-mutations that >>>>>> represent the current state in the sstable? If so, would the inclusion >>>>>> of later mutations after the mutation/cohort id in that partition cause >>>>>> any issues with reconciliation? (I see there's a hint about this in the >>>>>> bootstrap section below) >>>>>> >>>>>> Repair - Will sstables still be split into repaired/pending/unrepaired? >>>>>> Preserving that would make it possible to switch between strategies, >>>>>> it doesn't seem that complex, but maybe I'm missing something. >>>>>> >>>>>> Bootstrap/topology changes - what about RF changes. I don't think TCM >>>>>> currently handles that. Would it need to be added to make mutation >>>>>> tracking work? Where would the metadata be stored to indicate preferred >>>>>> sources for missing mutations? Would that also extend to nodes that have >>>>>> had to perform log truncation? >>>>>> >>>>>> Additional concerns >>>>>> >>>>>> Compaction - how are the mutation ids in sstable metadata handled when >>>>>> multiple sstables are compacted, particularly with something like >>>>>> range aware writers or when splitting the output over multiple >>>>>> size-bounded sstables. A simple union could expand the number >>>>>> of sstables to consider after log truncation. >>>>>> >>>>>> Thanks! >>>>>> Jon >>>>>> >>>>>> On Thu, Jan 16, 2025 at 11:51 AM Blake Eggleston <beggles...@apple.com >>>>>> <mailto:beggles...@apple.com>> wrote: >>>>>>> I’m not sure Josh. Jon brought up paging and the documentation around >>>>>>> it because our docs say we provide mutation level atomicity, but we >>>>>>> also provide drivers that page transparently. So from the user’s >>>>>>> perspective, a single “query” breaks this guarantee unpredictably. >>>>>>> Occasional exceptions with a clear message explaining what is >>>>>>> happening, why, and how to fix it is going to be less confusing that >>>>>>> tracking down application misbehavior caused by this. >>>>>>> >>>>>>> It would also be easy to make the time horizon for paging constant and >>>>>>> configurable (keep at least 20 minutes of logs, for instance), that >>>>>>> would at least provide a floor of predictability. >>>>>>> >>>>>>>> On Jan 16, 2025, at 10:08 AM, Josh McKenzie <jmcken...@apache.org >>>>>>>> <mailto:jmcken...@apache.org>> wrote: >>>>>>>> >>>>>>>>> The other issue is that there isn’t a time bound on the paging >>>>>>>>> payload, so if the application is taking long enough between pages >>>>>>>>> that the log has been truncated, we’d have to throw an exception. >>>>>>>> My hot-take is that this relationship between how long you're taking >>>>>>>> to page, how much data you're processing / getting back, and ingest / >>>>>>>> flushing frequency all combined leading to unpredictable exceptions >>>>>>>> would be a bad default from a UX perspective compared to a default of >>>>>>>> "a single page of data has atomicity; multiple pages do not". Maybe >>>>>>>> it's just because that's been our default for so long. >>>>>>>> >>>>>>>> The simplicity of having a flag that's "don't make my pages atomic and >>>>>>>> they always return vs. make my pages atomic and throw exceptions if >>>>>>>> the metadata I need is yoinked while I page" is pretty attractive to >>>>>>>> me. >>>>>>>> >>>>>>>> Really interesting thought, using these logs as "partial MVCC" while >>>>>>>> they're available specifically for what could/should be a very tight >>>>>>>> timeline use-case (paging). >>>>>>>> >>>>>>>> On Thu, Jan 16, 2025, at 12:41 PM, Jake Luciani wrote: >>>>>>>>> This is very cool! >>>>>>>>> >>>>>>>>> I have done a POC that was similar but more akin to Aurora paper >>>>>>>>> whereby the commitlog itself would repair itself from peers >>>>>>>>> proactively using the seekable commitlog. >>>>>>>>> >>>>>>>>> Can you explain the reason you prefer to reconcile on read? Having a >>>>>>>>> consistent commitlog would solve so many problems like CDC, PITR, MVs >>>>>>>>> etc. >>>>>>>>> >>>>>>>>> Jake >>>>>>>>> >>>>>>>>> On Thu, Jan 16, 2025 at 12:13 PM Blake Eggleston >>>>>>>>> <beggles...@apple.com <mailto:beggles...@apple.com>> wrote: >>>>>>>>> > >>>>>>>>> > I’ve been thinking about the paging atomicity issue. I think it >>>>>>>>> > could be fixed with mutation tracking and without having to support >>>>>>>>> > full on MVCC. >>>>>>>>> > >>>>>>>>> > When we reach a page boundary, we can send the highest mutation id >>>>>>>>> > we’ve seen for the partition we reached the paging boundary on. >>>>>>>>> > When we request another page, we send that high water mark back as >>>>>>>>> > part of the paging request. >>>>>>>>> > >>>>>>>>> > Each sstable and memtable contributing to the read responses will >>>>>>>>> > know which mutations it has in each partition, so if we encounter >>>>>>>>> > one that has a higher id than we saw in the last page, we >>>>>>>>> > reconstitute its data from mutations in the log, excluding the >>>>>>>>> > newer mutations., or exclude it entirely if it only has newer >>>>>>>>> > mutations. >>>>>>>>> > >>>>>>>>> > This isn’t free of course. When paging through large partitions, >>>>>>>>> > each page request becomes more likely to encounter mutations it >>>>>>>>> > needs to exclude, and it’s unclear how expensive that will be. >>>>>>>>> > Obviously it’s more expensive to reconstitute vs read, but on the >>>>>>>>> > other hand, only a single replica will be reading any data, so on >>>>>>>>> > balance it would still probably be less work for the cluster than >>>>>>>>> > running the normal read path. >>>>>>>>> > >>>>>>>>> > The other issue is that there isn’t a time bound on the paging >>>>>>>>> > payload, so if the application is taking long enough between pages >>>>>>>>> > that the log has been truncated, we’d have to throw an exception. >>>>>>>>> > >>>>>>>>> > This is mostly just me brainstorming though, and wouldn’t be >>>>>>>>> > something that would be in a v1. >>>>>>>>> > >>>>>>>>> > On Jan 9, 2025, at 2:07 PM, Blake Eggleston <beggles...@apple.com >>>>>>>>> > <mailto:beggles...@apple.com>> wrote: >>>>>>>>> > >>>>>>>>> > So the ids themselves are in the memtable and are accessible as >>>>>>>>> > soon as they’re written, and need to be for the read path to work. >>>>>>>>> > >>>>>>>>> > We’re not able to reconcile the ids until we can guarantee that >>>>>>>>> > they won’t be merged with unreconciled data, that’s why they’re >>>>>>>>> > flushed before reconciliation. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Jan 9, 2025, at 10:53 AM, Josh McKenzie <jmcken...@apache.org >>>>>>>>> > <mailto:jmcken...@apache.org>> wrote: >>>>>>>>> > >>>>>>>>> > We also can't remove mutation ids until they've been reconciled, so >>>>>>>>> > in the simplest implementation, we'd need to flush a memtable >>>>>>>>> > before reconciling, and there would never be a situation where you >>>>>>>>> > have purgeable mutation ids in the memtable. >>>>>>>>> > >>>>>>>>> > Got it. So effectively that data would be unreconcilable until such >>>>>>>>> > time as it was flushed and you had those id's to work with in the >>>>>>>>> > sstable metadata, and the process can force a flush to reconcile in >>>>>>>>> > those cases where you have mutations in the MT/CL combo that are >>>>>>>>> > transiently not subject to the reconciliation process due to that >>>>>>>>> > log being purged. Or you flush before purging the log, assuming >>>>>>>>> > we're not changing MT data structures to store id (don't recall if >>>>>>>>> > that's specified in the CEP...) >>>>>>>>> > >>>>>>>>> > Am I grokking that? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote: >>>>>>>>> > >>>>>>>>> > Hi Josh, >>>>>>>>> > >>>>>>>>> > You can think of reconciliation as analogous to incremental repair. >>>>>>>>> > Like incremental repair, you can't mix reconciled/unreconciled data >>>>>>>>> > without causing problem. We also can't remove mutation ids until >>>>>>>>> > they've been reconciled, so in the simplest implementation, we'd >>>>>>>>> > need to flush a memtable before reconciling, and there would never >>>>>>>>> > be a situation where you have purgeable mutation ids in the >>>>>>>>> > memtable. >>>>>>>>> > >>>>>>>>> > The production version of this will be more sophisticated about how >>>>>>>>> > it keeps this data separate to it can reliably support automatic >>>>>>>>> > reconciliation cadences that are higher than what you can do with >>>>>>>>> > incremental repair today, but that’s the short answer. >>>>>>>>> > >>>>>>>>> > It's also likely that the concept of log truncation will be removed >>>>>>>>> > in favor of going straight to cohort reconciliation in longer >>>>>>>>> > outages. >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > >>>>>>>>> > Blake >>>>>>>>> > >>>>>>>>> > On Jan 9, 2025, at 8:27 AM, Josh McKenzie <jmcken...@apache.org >>>>>>>>> > <mailto:jmcken...@apache.org>> wrote: >>>>>>>>> > >>>>>>>>> > Question re: Log Truncation (emphasis mine): >>>>>>>>> > >>>>>>>>> > When the cluster is operating normally, logs entries can be >>>>>>>>> > discarded once they are older than the last reconciliation time of >>>>>>>>> > their respective ranges. To prevent unbounded log growth during >>>>>>>>> > outages however, logs are still deleted once they reach some >>>>>>>>> > configurable amount of time (maybe 2 hours by default?). From here, >>>>>>>>> > all reconciliation processes behave the same as before, but they >>>>>>>>> > use mutation ids stored in sstable metadata for listing mutation >>>>>>>>> > ids and transmit missing partitions. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > What happens when / if we have data living in a memtable past that >>>>>>>>> > time threshold that hasn't yet been flushed to an sstable? i.e. low >>>>>>>>> > velocity table or a really tightly configured "purge my mutation >>>>>>>>> > reconciliation logs at time bound X". >>>>>>>>> > >>>>>>>>> > On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote: >>>>>>>>> > >>>>>>>>> > Is this something we can disable? I can see scenarios where this >>>>>>>>> > would be strictly and severely worse then existing scenarios where >>>>>>>>> > we don't need repairs. ie short time window data, millions of >>>>>>>>> > writes a second that get thrown out after a few hours. If that data >>>>>>>>> > is small partitions we are nearly doubling the disk use for things >>>>>>>>> > we don't care about. >>>>>>>>> > >>>>>>>>> > Chris >>>>>>>>> > >>>>>>>>> > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com >>>>>>>>> > <mailto:cclive1...@gmail.com>> wrote: >>>>>>>>> > >>>>>>>>> > After a brief understanding, there are 2 questions from me, If I >>>>>>>>> > ask something inappropriate, please feel free to correct me : >>>>>>>>> > >>>>>>>>> > 1、 Does it support changing the table to support mutation tracking >>>>>>>>> > through ALTER TABLE if it does not support mutation tracking before? >>>>>>>>> > 2、 >>>>>>>>> > >>>>>>>>> > Available options for tables are keyspace, legacy, and logged, with >>>>>>>>> > the default being keyspace, which inherits the keyspace setting >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Do you think that keyspace_inherit (or other keywords that clearly >>>>>>>>> > explain the behavior ) is better than name keyspace ? >>>>>>>>> > In addition, is legacy appropriate? Because this is a new feature, >>>>>>>>> > there is only the behavior of turning it on and off. Turning it off >>>>>>>>> > means not using this feature. >>>>>>>>> > If the keyword legacy is used, from the user's perspective, is it >>>>>>>>> > using an old version of the mutation tracking? Similar to the >>>>>>>>> > relationship between SAI and native2i. >>>>>>>>> > >>>>>>>>> > Jon Haddad <j...@rustyrazorblade.com >>>>>>>>> > <mailto:j...@rustyrazorblade.com>> 于2025年1月9日周四 06:14写道: >>>>>>>>> > >>>>>>>>> > JD, the fact that pagination is implemented as multiple queries is >>>>>>>>> > a design choice. A user performs a query with fetch size 1 or 100 >>>>>>>>> > and they will get different behavior. >>>>>>>>> > >>>>>>>>> > I'm not asking for anyone to implement MVCC. I'm asking for the >>>>>>>>> > docs around this to be correct. We should not use the term >>>>>>>>> > guarantee here, it's best effort. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan >>>>>>>>> > <jeremiah.jor...@gmail.com <mailto:jeremiah.jor...@gmail.com>> >>>>>>>>> > wrote: >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Your pagination case is not a violation of any guarantees Cassandra >>>>>>>>> > makes. It has never made guarantees across multiple queries. >>>>>>>>> > Trying to have MVCC/consistent data across multiple queries is a >>>>>>>>> > very different issue/problem from this CEP. If you want to have a >>>>>>>>> > discussion about MVCC I suggest creating a new thread. >>>>>>>>> > >>>>>>>>> > -Jeremiah >>>>>>>>> > >>>>>>>>> > On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com >>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote: >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > > It's true that we can't offer multi-page write atomicity without >>>>>>>>> > > some sort of MVCC. There are a lot of common query patterns that >>>>>>>>> > > don't involve paging though, so it's not like the benefit of >>>>>>>>> > > fixing write atomicity would only apply to a small subset of >>>>>>>>> > > carefully crafted queries or something. >>>>>>>>> > >>>>>>>>> > Sure, it'll work a lot, but we don't say "partition level write >>>>>>>>> > atomicity some of the time". We say guarantee. From the CEP: >>>>>>>>> > >>>>>>>>> > > In the case of read repair, since we are only reading and >>>>>>>>> > > correcting the parts of a partition that we're reading and not >>>>>>>>> > > the entire contents of a partition on each read, read repair can >>>>>>>>> > > break our guarantee on partition level write atomicity. This >>>>>>>>> > > approach also prevents meeting the monotonic read requirement for >>>>>>>>> > > witness replicas, which has significantly limited its usefulness. >>>>>>>>> > >>>>>>>>> > I point this out because it's not well known, and we make a >>>>>>>>> > guarantee that isn't true, and while the CEP will reduce the number >>>>>>>>> > of cases in which we violate the guarantee, we will still have >>>>>>>>> > known edge cases that it doesn't hold up. So we should stop saying >>>>>>>>> > it. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston >>>>>>>>> > <beggles...@apple.com <mailto:beggles...@apple.com>> wrote: >>>>>>>>> > >>>>>>>>> > Thanks Dimitry and Jon, answers below >>>>>>>>> > >>>>>>>>> > 1) Is a single separate commit log expected to be created for all >>>>>>>>> > tables with the new replication type? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > The plan is to still have a single commit log, but only index >>>>>>>>> > mutations with a mutation id. >>>>>>>>> > >>>>>>>>> > 2) What is a granularity of storing mutation ids in memtable, is it >>>>>>>>> > per cell? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > It would be per-partition >>>>>>>>> > >>>>>>>>> > 3) If we update the same row multiple times while it is in a >>>>>>>>> > memtable - are all mutation ids appended to a kind of collection? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > They would yes. We might be able to do something where we stop >>>>>>>>> > tracking mutations that have been superseded by newer mutations >>>>>>>>> > (same cells, higher timestamps), but I suspect that would be more >>>>>>>>> > trouble than it's worth and would be out of scope for v1. >>>>>>>>> > >>>>>>>>> > 4) What is the expected size of a single id? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte >>>>>>>>> > hlc >>>>>>>>> > >>>>>>>>> > 5) Do we plan to support multi-table batches (single or >>>>>>>>> > multi-partition) for this replication type? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > This is intended to support all existing features, however the >>>>>>>>> > tracking only happens at the mutation level, so the different >>>>>>>>> > mutations coming out of a multi-partition batch would all be >>>>>>>>> > tracked individually >>>>>>>>> > >>>>>>>>> > So even without repair mucking things up, we're unable to fulfill >>>>>>>>> > this promise except under the specific, ideal circumstance of >>>>>>>>> > querying a partition with only 1 page. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > It's true that we can't offer multi-page write atomicity without >>>>>>>>> > some sort of MVCC. There are a lot of common query patterns that >>>>>>>>> > don't involve paging though, so it's not like the benefit of fixing >>>>>>>>> > write atomicity would only apply to a small subset of carefully >>>>>>>>> > crafted queries or something. >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > >>>>>>>>> > Blake >>>>>>>>> > >>>>>>>>> > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com >>>>>>>>> > <mailto:j...@rustyrazorblade.com>> wrote: >>>>>>>>> > >>>>>>>>> > Very cool! I'll need to spent some time reading this over. One >>>>>>>>> > thing I did notice is this: >>>>>>>>> > >>>>>>>>> > > Cassandra promises partition level write atomicity. This means >>>>>>>>> > > that, although writes are eventually consistent, a given write >>>>>>>>> > > will either be visible or not visible. You're not supposed to see >>>>>>>>> > > a partially applied write. However, read repair and short read >>>>>>>>> > > protection can both "tear" mutations. In the case of read repair, >>>>>>>>> > > this is because the data resolver only evaluates the data >>>>>>>>> > > included in the client read. So if your read only covers a >>>>>>>>> > > portion of a write that didn't reach a quorum, only that portion >>>>>>>>> > > will be repaired, breaking write atomicity. >>>>>>>>> > >>>>>>>>> > Unfortunately there's more issues with this than just repair. >>>>>>>>> > Since we lack a consistency mechanism like MVCC while paginating, >>>>>>>>> > it's possible to do the following: >>>>>>>>> > >>>>>>>>> > thread A: reads a partition P with 10K rows, starts by reading the >>>>>>>>> > first page >>>>>>>>> > thread B: another thread writes a batch to 2 rows in partition P, >>>>>>>>> > one on page 1, another on page 2 >>>>>>>>> > thread A: reads the second page of P which has the mutation. >>>>>>>>> > >>>>>>>>> > I've worked with users who have been surprised by this behavior, >>>>>>>>> > because pagination happens transparently. >>>>>>>>> > >>>>>>>>> > So even without repair mucking things up, we're unable to fulfill >>>>>>>>> > this promise except under the specific, ideal circumstance of >>>>>>>>> > querying a partition with only 1 page. >>>>>>>>> > >>>>>>>>> > Jon >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston >>>>>>>>> > <beggles...@apple.com <mailto:beggles...@apple.com>> wrote: >>>>>>>>> > >>>>>>>>> > Hello dev@, >>>>>>>>> > >>>>>>>>> > We'd like to propose CEP-45: Mutation Tracking for adoption by the >>>>>>>>> > community. CEP-45 proposes adding a replication mechanism to track >>>>>>>>> > and reconcile individual mutations, as well as processes to >>>>>>>>> > actively reconcile missing mutations. >>>>>>>>> > >>>>>>>>> > For keyspaces with mutation tracking enabled, the immediate >>>>>>>>> > benefits of this CEP are: >>>>>>>>> > * reduced replication lag with a continuous background >>>>>>>>> > reconciliation process >>>>>>>>> > * eliminate the disk load caused by repair merkle tree calculation >>>>>>>>> > * eliminate repair overstreaming >>>>>>>>> > * reduce disk load of reads on cluster to close to 1/CL >>>>>>>>> > * fix longstanding mutation atomicity issues caused by read repair >>>>>>>>> > and short read protection >>>>>>>>> > >>>>>>>>> > Additionally, although it's outside the scope of this CEP, mutation >>>>>>>>> > tracking would enable: >>>>>>>>> > * completion of witness replicas / transient replication, making >>>>>>>>> > the feature usable for all workloads >>>>>>>>> > * lightweight witness only datacenters >>>>>>>>> > >>>>>>>>> > The CEP is linked here: >>>>>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, >>>>>>>>> > but please keep the discussion on the dev list. >>>>>>>>> > >>>>>>>>> > Thanks! >>>>>>>>> > >>>>>>>>> > Blake Eggleston >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> http://twitter.com/tjake >>>>>>> >>>>> >>> >