> The other issue is that there isn’t a time bound on the paging payload, so if > the application is taking long enough between pages that the log has been > truncated, we’d have to throw an exception. My hot-take is that this relationship between how long you're taking to page, how much data you're processing / getting back, and ingest / flushing frequency all combined leading to unpredictable exceptions would be a bad default from a UX perspective compared to a default of "a single page of data has atomicity; multiple pages do not". Maybe it's just because that's been our default for so long.
The simplicity of having a flag that's "don't make my pages atomic and they always return vs. make my pages atomic and throw exceptions if the metadata I need is yoinked while I page" is pretty attractive to me. Really interesting thought, using these logs as "partial MVCC" while they're available specifically for what could/should be a very tight timeline use-case (paging). On Thu, Jan 16, 2025, at 12:41 PM, Jake Luciani wrote: > This is very cool! > > I have done a POC that was similar but more akin to Aurora paper > whereby the commitlog itself would repair itself from peers > proactively using the seekable commitlog. > > Can you explain the reason you prefer to reconcile on read? Having a > consistent commitlog would solve so many problems like CDC, PITR, MVs > etc. > > Jake > > On Thu, Jan 16, 2025 at 12:13 PM Blake Eggleston <beggles...@apple.com> wrote: > > > > I’ve been thinking about the paging atomicity issue. I think it could be > > fixed with mutation tracking and without having to support full on MVCC. > > > > When we reach a page boundary, we can send the highest mutation id we’ve > > seen for the partition we reached the paging boundary on. When we request > > another page, we send that high water mark back as part of the paging > > request. > > > > Each sstable and memtable contributing to the read responses will know > > which mutations it has in each partition, so if we encounter one that has a > > higher id than we saw in the last page, we reconstitute its data from > > mutations in the log, excluding the newer mutations., or exclude it > > entirely if it only has newer mutations. > > > > This isn’t free of course. When paging through large partitions, each page > > request becomes more likely to encounter mutations it needs to exclude, and > > it’s unclear how expensive that will be. Obviously it’s more expensive to > > reconstitute vs read, but on the other hand, only a single replica will be > > reading any data, so on balance it would still probably be less work for > > the cluster than running the normal read path. > > > > The other issue is that there isn’t a time bound on the paging payload, so > > if the application is taking long enough between pages that the log has > > been truncated, we’d have to throw an exception. > > > > This is mostly just me brainstorming though, and wouldn’t be something that > > would be in a v1. > > > > On Jan 9, 2025, at 2:07 PM, Blake Eggleston <beggles...@apple.com> wrote: > > > > So the ids themselves are in the memtable and are accessible as soon as > > they’re written, and need to be for the read path to work. > > > > We’re not able to reconcile the ids until we can guarantee that they won’t > > be merged with unreconciled data, that’s why they’re flushed before > > reconciliation. > > > > > > On Jan 9, 2025, at 10:53 AM, Josh McKenzie <jmcken...@apache.org> wrote: > > > > We also can't remove mutation ids until they've been reconciled, so in the > > simplest implementation, we'd need to flush a memtable before reconciling, > > and there would never be a situation where you have purgeable mutation ids > > in the memtable. > > > > Got it. So effectively that data would be unreconcilable until such time as > > it was flushed and you had those id's to work with in the sstable metadata, > > and the process can force a flush to reconcile in those cases where you > > have mutations in the MT/CL combo that are transiently not subject to the > > reconciliation process due to that log being purged. Or you flush before > > purging the log, assuming we're not changing MT data structures to store id > > (don't recall if that's specified in the CEP...) > > > > Am I grokking that? > > > > > > On Thu, Jan 9, 2025, at 1:49 PM, Blake Eggleston wrote: > > > > Hi Josh, > > > > You can think of reconciliation as analogous to incremental repair. Like > > incremental repair, you can't mix reconciled/unreconciled data without > > causing problem. We also can't remove mutation ids until they've been > > reconciled, so in the simplest implementation, we'd need to flush a > > memtable before reconciling, and there would never be a situation where you > > have purgeable mutation ids in the memtable. > > > > The production version of this will be more sophisticated about how it > > keeps this data separate to it can reliably support automatic > > reconciliation cadences that are higher than what you can do with > > incremental repair today, but that’s the short answer. > > > > It's also likely that the concept of log truncation will be removed in > > favor of going straight to cohort reconciliation in longer outages. > > > > Thanks, > > > > Blake > > > > On Jan 9, 2025, at 8:27 AM, Josh McKenzie <jmcken...@apache.org> wrote: > > > > Question re: Log Truncation (emphasis mine): > > > > When the cluster is operating normally, logs entries can be discarded once > > they are older than the last reconciliation time of their respective > > ranges. To prevent unbounded log growth during outages however, logs are > > still deleted once they reach some configurable amount of time (maybe 2 > > hours by default?). From here, all reconciliation processes behave the same > > as before, but they use mutation ids stored in sstable metadata for listing > > mutation ids and transmit missing partitions. > > > > > > What happens when / if we have data living in a memtable past that time > > threshold that hasn't yet been flushed to an sstable? i.e. low velocity > > table or a really tightly configured "purge my mutation reconciliation logs > > at time bound X". > > > > On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote: > > > > Is this something we can disable? I can see scenarios where this would be > > strictly and severely worse then existing scenarios where we don't need > > repairs. ie short time window data, millions of writes a second that get > > thrown out after a few hours. If that data is small partitions we are > > nearly doubling the disk use for things we don't care about. > > > > Chris > > > > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote: > > > > After a brief understanding, there are 2 questions from me, If I ask > > something inappropriate, please feel free to correct me : > > > > 1、 Does it support changing the table to support mutation tracking through > > ALTER TABLE if it does not support mutation tracking before? > > 2、 > > > > Available options for tables are keyspace, legacy, and logged, with the > > default being keyspace, which inherits the keyspace setting > > > > > > Do you think that keyspace_inherit (or other keywords that clearly explain > > the behavior ) is better than name keyspace ? > > In addition, is legacy appropriate? Because this is a new feature, there is > > only the behavior of turning it on and off. Turning it off means not using > > this feature. > > If the keyword legacy is used, from the user's perspective, is it using an > > old version of the mutation tracking? Similar to the relationship between > > SAI and native2i. > > > > Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道: > > > > JD, the fact that pagination is implemented as multiple queries is a design > > choice. A user performs a query with fetch size 1 or 100 and they will get > > different behavior. > > > > I'm not asking for anyone to implement MVCC. I'm asking for the docs > > around this to be correct. We should not use the term guarantee here, it's > > best effort. > > > > > > > > > > On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com> > > wrote: > > > > > > Your pagination case is not a violation of any guarantees Cassandra makes. > > It has never made guarantees across multiple queries. > > Trying to have MVCC/consistent data across multiple queries is a very > > different issue/problem from this CEP. If you want to have a discussion > > about MVCC I suggest creating a new thread. > > > > -Jeremiah > > > > On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > > > > > It's true that we can't offer multi-page write atomicity without some > > > sort of MVCC. There are a lot of common query patterns that don't involve > > > paging though, so it's not like the benefit of fixing write atomicity > > > would only apply to a small subset of carefully crafted queries or > > > something. > > > > Sure, it'll work a lot, but we don't say "partition level write atomicity > > some of the time". We say guarantee. From the CEP: > > > > > In the case of read repair, since we are only reading and correcting the > > > parts of a partition that we're reading and not the entire contents of a > > > partition on each read, read repair can break our guarantee on partition > > > level write atomicity. This approach also prevents meeting the monotonic > > > read requirement for witness replicas, which has significantly limited > > > its usefulness. > > > > I point this out because it's not well known, and we make a guarantee that > > isn't true, and while the CEP will reduce the number of cases in which we > > violate the guarantee, we will still have known edge cases that it doesn't > > hold up. So we should stop saying it. > > > > > > > > > > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> wrote: > > > > Thanks Dimitry and Jon, answers below > > > > 1) Is a single separate commit log expected to be created for all tables > > with the new replication type? > > > > > > The plan is to still have a single commit log, but only index mutations > > with a mutation id. > > > > 2) What is a granularity of storing mutation ids in memtable, is it per > > cell? > > > > > > It would be per-partition > > > > 3) If we update the same row multiple times while it is in a memtable - are > > all mutation ids appended to a kind of collection? > > > > > > They would yes. We might be able to do something where we stop tracking > > mutations that have been superseded by newer mutations (same cells, higher > > timestamps), but I suspect that would be more trouble than it's worth and > > would be out of scope for v1. > > > > 4) What is the expected size of a single id? > > > > > > It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc > > > > 5) Do we plan to support multi-table batches (single or multi-partition) > > for this replication type? > > > > > > This is intended to support all existing features, however the tracking > > only happens at the mutation level, so the different mutations coming out > > of a multi-partition batch would all be tracked individually > > > > So even without repair mucking things up, we're unable to fulfill this > > promise except under the specific, ideal circumstance of querying a > > partition with only 1 page. > > > > > > It's true that we can't offer multi-page write atomicity without some sort > > of MVCC. There are a lot of common query patterns that don't involve paging > > though, so it's not like the benefit of fixing write atomicity would only > > apply to a small subset of carefully crafted queries or something. > > > > Thanks, > > > > Blake > > > > On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > > Very cool! I'll need to spent some time reading this over. One thing I > > did notice is this: > > > > > Cassandra promises partition level write atomicity. This means that, > > > although writes are eventually consistent, a given write will either be > > > visible or not visible. You're not supposed to see a partially applied > > > write. However, read repair and short read protection can both "tear" > > > mutations. In the case of read repair, this is because the data resolver > > > only evaluates the data included in the client read. So if your read only > > > covers a portion of a write that didn't reach a quorum, only that portion > > > will be repaired, breaking write atomicity. > > > > Unfortunately there's more issues with this than just repair. Since we > > lack a consistency mechanism like MVCC while paginating, it's possible to > > do the following: > > > > thread A: reads a partition P with 10K rows, starts by reading the first > > page > > thread B: another thread writes a batch to 2 rows in partition P, one on > > page 1, another on page 2 > > thread A: reads the second page of P which has the mutation. > > > > I've worked with users who have been surprised by this behavior, because > > pagination happens transparently. > > > > So even without repair mucking things up, we're unable to fulfill this > > promise except under the specific, ideal circumstance of querying a > > partition with only 1 page. > > > > Jon > > > > > > > > > > > > On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> > > wrote: > > > > Hello dev@, > > > > We'd like to propose CEP-45: Mutation Tracking for adoption by the > > community. CEP-45 proposes adding a replication mechanism to track and > > reconcile individual mutations, as well as processes to actively reconcile > > missing mutations. > > > > For keyspaces with mutation tracking enabled, the immediate benefits of > > this CEP are: > > * reduced replication lag with a continuous background reconciliation > > process > > * eliminate the disk load caused by repair merkle tree calculation > > * eliminate repair overstreaming > > * reduce disk load of reads on cluster to close to 1/CL > > * fix longstanding mutation atomicity issues caused by read repair and > > short read protection > > > > Additionally, although it's outside the scope of this CEP, mutation > > tracking would enable: > > * completion of witness replicas / transient replication, making the > > feature usable for all workloads > > * lightweight witness only datacenters > > > > The CEP is linked here: > > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, > > but please keep the discussion on the dev list. > > > > Thanks! > > > > Blake Eggleston > > > > > > > > > -- > http://twitter.com/tjake >