Question re: Log Truncation <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337676893#CEP45:MutationTracking-Logtruncation> (emphasis mine):
> When the cluster is operating normally, logs entries can be discarded once > they are older than the last reconciliation time of their respective ranges. > To prevent unbounded log growth during outages however, logs are still > deleted once they reach some configurable amount of time (maybe 2 hours by > default?). F*rom here, all reconciliation processes behave the same as > before, but they use mutation ids stored in sstable metadata for listing > mutation ids and transmit missing partitions.* What happens when / if we have data living in a memtable past that time threshold that hasn't yet been flushed to an sstable? i.e. low velocity table or a really tightly configured "purge my mutation reconciliation logs at time bound X". On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote: > Is this something we can disable? I can see scenarios where this would be > strictly and severely worse then existing scenarios where we don't need > repairs. ie short time window data, millions of writes a second that get > thrown out after a few hours. If that data is small partitions we are nearly > doubling the disk use for things we don't care about. > > Chris > > On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote: >> After a brief understanding, there are 2 questions from me, If I ask >> something inappropriate, please feel free to correct me : >> >> 1、 Does it support changing the table to support mutation tracking through >> ALTER TABLE if it does not support mutation tracking before? >> 2、 >>> Available options for tables are `keyspace`, `legacy`, and `logged`, with >>> the default being `keyspace`, which inherits the keyspace setting >> >> Do you think that keyspace_inherit (or other keywords that clearly explain >> the behavior ) is better than name keyspace ? >> In addition, is legacy appropriate? Because this is a new feature, there is >> only the behavior of turning it on and off. Turning it off means not using >> this feature. >> If the keyword legacy is used, from the user's perspective, is it using an >> old version of the mutation tracking? Similar to the relationship between >> SAI and native2i. >> >> Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道: >>> JD, the fact that pagination is implemented as multiple queries is a design >>> choice. A user performs a query with fetch size 1 or 100 and they will get >>> different behavior. >>> >>> I'm not asking for anyone to implement MVCC. I'm asking for the docs >>> around this to be correct. We should not use the term guarantee here, it's >>> best effort. >>> >>> >>> >>> >>> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com> >>> wrote: >>>> >>>> Your pagination case is not a violation of any guarantees Cassandra makes. >>>> It has never made guarantees across multiple queries. >>>> Trying to have MVCC/consistent data across multiple queries is a very >>>> different issue/problem from this CEP. If you want to have a discussion >>>> about MVCC I suggest creating a new thread. >>>> >>>> -Jeremiah >>>> >>>>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: >>>>> >>>>> > It's true that we can't offer multi-page write atomicity without some >>>>> > sort of MVCC. There are a lot of common query patterns that don't >>>>> > involve paging though, so it's not like the benefit of fixing write >>>>> > atomicity would only apply to a small subset of carefully crafted >>>>> > queries or something. >>>>> >>>>> Sure, it'll work a lot, but we don't say "partition level write atomicity >>>>> some of the time". We say guarantee. From the CEP: >>>>> >>>>> > In the case of read repair, since we are only reading and correcting >>>>> > the parts of a partition that we're reading and not the entire contents >>>>> > of a partition on each read, read repair can break our *guarantee* on >>>>> > partition level write atomicity. This approach also prevents meeting >>>>> > the monotonic read requirement for witness replicas, which has >>>>> > significantly limited its usefulness. >>>>> >>>>> I point this out because it's not well known, and we make a guarantee >>>>> that isn't true, and while the CEP will reduce the number of cases in >>>>> which we violate the guarantee, we will still have known edge cases that >>>>> it doesn't hold up. So we should stop saying it. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> >>>>> wrote: >>>>>> Thanks Dimitry and Jon, answers below >>>>>> >>>>>>> 1) Is a single separate commit log expected to be created for all >>>>>>> tables with the new replication type? >>>>>> >>>>>> The plan is to still have a single commit log, but only index mutations >>>>>> with a mutation id. >>>>>> >>>>>>> 2) What is a granularity of storing mutation ids in memtable, is it per >>>>>>> cell? >>>>>> >>>>>> It would be per-partition >>>>>> >>>>>>> 3) If we update the same row multiple times while it is in a memtable - >>>>>>> are all mutation ids appended to a kind of collection? >>>>>> >>>>>> They would yes. We might be able to do something where we stop tracking >>>>>> mutations that have been superseded by newer mutations (same cells, >>>>>> higher timestamps), but I suspect that would be more trouble than it's >>>>>> worth and would be out of scope for v1. >>>>>> >>>>>>> 4) What is the expected size of a single id? >>>>>> >>>>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc >>>>>> >>>>>>> 5) Do we plan to support multi-table batches (single or >>>>>>> multi-partition) for this replication type? >>>>>> >>>>>> This is intended to support all existing features, however the tracking >>>>>> only happens at the mutation level, so the different mutations coming >>>>>> out of a multi-partition batch would all be tracked individually >>>>>> >>>>>>> So even without repair mucking things up, we're unable to fulfill this >>>>>>> promise except under the specific, ideal circumstance of querying a >>>>>>> partition with only 1 page. >>>>>> >>>>>> It's true that we can't offer multi-page write atomicity without some >>>>>> sort of MVCC. There are a lot of common query patterns that don't >>>>>> involve paging though, so it's not like the benefit of fixing write >>>>>> atomicity would only apply to a small subset of carefully crafted >>>>>> queries or something. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Blake >>>>>> >>>>>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> >>>>>>> wrote: >>>>>>> >>>>>>> Very cool! I'll need to spent some time reading this over. One thing >>>>>>> I did notice is this: >>>>>>> >>>>>>> > Cassandra promises partition level write atomicity. This means that, >>>>>>> > although writes are eventually consistent, a given write will either >>>>>>> > be visible or not visible. You're not supposed to see a partially >>>>>>> > applied write. However, read repair and short read protection can >>>>>>> > both "tear" mutations. In the case of read repair, this is because >>>>>>> > the data resolver only evaluates the data included in the client >>>>>>> > read. So if your read only covers a portion of a write that didn't >>>>>>> > reach a quorum, only that portion will be repaired, breaking write >>>>>>> > atomicity. >>>>>>> >>>>>>> Unfortunately there's more issues with this than just repair. Since we >>>>>>> lack a consistency mechanism like MVCC while paginating, it's possible >>>>>>> to do the following: >>>>>>> >>>>>>> thread A: reads a partition P with 10K rows, starts by reading the >>>>>>> first page >>>>>>> thread B: another thread writes a batch to 2 rows in partition P, one >>>>>>> on page 1, another on page 2 >>>>>>> thread A: reads the second page of P which has the mutation. >>>>>>> >>>>>>> I've worked with users who have been surprised by this behavior, >>>>>>> because pagination happens transparently. >>>>>>> >>>>>>> So even without repair mucking things up, we're unable to fulfill this >>>>>>> promise except under the specific, ideal circumstance of querying a >>>>>>> partition with only 1 page. >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> >>>>>>> wrote: >>>>>>>> Hello dev@, >>>>>>>> >>>>>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the >>>>>>>> community. CEP-45 proposes adding a replication mechanism to track and >>>>>>>> reconcile individual mutations, as well as processes to actively >>>>>>>> reconcile missing mutations. >>>>>>>> >>>>>>>> For keyspaces with mutation tracking enabled, the immediate benefits >>>>>>>> of this CEP are: >>>>>>>> * reduced replication lag with a continuous background reconciliation >>>>>>>> process >>>>>>>> * eliminate the disk load caused by repair merkle tree calculation >>>>>>>> * eliminate repair overstreaming >>>>>>>> * reduce disk load of reads on cluster to close to 1/CL >>>>>>>> * fix longstanding mutation atomicity issues caused by read repair and >>>>>>>> short read protection >>>>>>>> >>>>>>>> Additionally, although it's outside the scope of this CEP, mutation >>>>>>>> tracking would enable: >>>>>>>> * completion of witness replicas / transient replication, making the >>>>>>>> feature usable for all workloads >>>>>>>> * lightweight witness only datacenters >>>>>>>> >>>>>>>> The CEP is linked here: >>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, >>>>>>>> but please keep the discussion on the dev list. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Blake Eggleston