Re: [DISCUSS] CEP-45: Mutation Tracking

Josh McKenzie Thu, 09 Jan 2025 08:28:20 -0800

Question re: Log Truncation 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337676893#CEP45:MutationTracking-Logtruncation>
 (emphasis mine):


> When the cluster is operating normally, logs entries can be discarded once 
> they are older than the last reconciliation time of their respective ranges. 
> To prevent unbounded log growth during outages however, logs are still 
> deleted once they reach some configurable amount of time (maybe 2 hours by 
> default?). F*rom here, all reconciliation processes behave the same as 
> before, but they use mutation ids stored in sstable metadata for listing 
> mutation ids and transmit missing partitions.*

What happens when / if we have data living in a memtable past that time 
threshold that hasn't yet been flushed to an sstable? i.e. low velocity table 
or a really tightly configured "purge my mutation reconciliation logs at time 
bound X".

On Thu, Jan 9, 2025, at 10:07 AM, Chris Lohfink wrote:
> Is this something we can disable? I can see scenarios where this would be 
> strictly and severely worse then existing scenarios where we don't need 
> repairs. ie short time window data, millions of writes a second that get 
> thrown out after a few hours. If that data is small partitions we are nearly 
> doubling the disk use for things we don't care about.
> 
> Chris
> 
> On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote:
>> After a brief understanding, there are 2 questions from me, If I ask 
>> something inappropriate, please feel free to correct me :
>> 
>> 1、 Does it support changing the table to support mutation tracking through 
>> ALTER TABLE if it does not support mutation tracking before?
>> 2、
>>> Available options for tables are `keyspace`, `legacy`, and `logged`, with 
>>> the default being `keyspace`, which inherits the keyspace setting
>>  
>> Do you think that keyspace_inherit  (or other keywords that clearly explain 
>> the behavior ) is better than name keyspace ?  
>> In addition, is legacy appropriate? Because this is a new feature, there is 
>> only the behavior of turning it on and off. Turning it off means not using 
>> this feature. 
>> If the keyword legacy is used, from the user's perspective, is it using an 
>> old version of the mutation tracking? Similar to the relationship between 
>> SAI and native2i.
>> 
>> Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道：
>>> JD, the fact that pagination is implemented as multiple queries is a design 
>>> choice.  A user performs a query with fetch size 1 or 100 and they will get 
>>> different behavior. 
>>> 
>>> I'm not asking for anyone to implement MVCC.  I'm asking for the docs 
>>> around this to be correct.  We should not use the term guarantee here, it's 
>>> best effort.
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com> 
>>> wrote:
>>>> 
>>>> Your pagination case is not a violation of any guarantees Cassandra makes. 
>>>> It has never made guarantees across multiple queries.
>>>> Trying to have MVCC/consistent data across multiple queries is a very 
>>>> different issue/problem from this CEP.  If you want to have a discussion 
>>>> about MVCC I suggest creating a new thread.
>>>> 
>>>> -Jeremiah
>>>> 
>>>>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>>>>> 
>>>>> > It's true that we can't offer multi-page write atomicity without some 
>>>>> > sort of MVCC. There are a lot of common query patterns that don't 
>>>>> > involve paging though, so it's not like the benefit of fixing write 
>>>>> > atomicity would only apply to a small subset of carefully crafted 
>>>>> > queries or something.
>>>>> 
>>>>> Sure, it'll work a lot, but we don't say "partition level write atomicity 
>>>>> some of the time".  We say guarantee.  From the CEP:
>>>>> 
>>>>> > In the case of read repair, since we are only reading and correcting 
>>>>> > the parts of a partition that we're reading and not the entire contents 
>>>>> > of a partition on each read, read repair can break our *guarantee* on 
>>>>> > partition level write atomicity. This approach also prevents meeting 
>>>>> > the monotonic read requirement for witness replicas, which has 
>>>>> > significantly limited its usefulness.
>>>>> 
>>>>> I point this out because it's not well known, and we make a guarantee 
>>>>> that isn't true, and while the CEP will reduce the number of cases in 
>>>>> which we violate the guarantee, we will still have known edge cases that 
>>>>> it doesn't hold up.  So we should stop saying it. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> 
>>>>> wrote:
>>>>>> Thanks Dimitry and Jon, answers below
>>>>>> 
>>>>>>> 1) Is a single separate commit log expected to be created for all 
>>>>>>> tables with the new replication type?
>>>>>> 
>>>>>> The plan is to still have a single commit log, but only index mutations 
>>>>>> with a mutation id. 
>>>>>> 
>>>>>>> 2) What is a granularity of storing mutation ids in memtable, is it per 
>>>>>>> cell?
>>>>>> 
>>>>>> It would be per-partition
>>>>>> 
>>>>>>> 3) If we update the same row multiple times while it is in a memtable - 
>>>>>>> are all mutation ids appended to a kind of collection?
>>>>>> 
>>>>>> They would yes. We might be able to do something where we stop tracking 
>>>>>> mutations that have been superseded by newer mutations (same cells, 
>>>>>> higher timestamps), but I suspect that would be more trouble than it's 
>>>>>> worth and would be out of scope for v1.
>>>>>> 
>>>>>>> 4) What is the expected size of a single id?
>>>>>> 
>>>>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
>>>>>> 
>>>>>>> 5) Do we plan to support multi-table batches (single or 
>>>>>>> multi-partition) for this replication type?
>>>>>> 
>>>>>> This is intended to support all existing features, however the tracking 
>>>>>> only happens at the mutation level, so the different mutations coming 
>>>>>> out of a multi-partition batch would all be tracked individually
>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>> 
>>>>>> It's true that we can't offer multi-page write atomicity without some 
>>>>>> sort of MVCC. There are a lot of common query patterns that don't 
>>>>>> involve paging though, so it's not like the benefit of fixing write 
>>>>>> atomicity would only apply to a small subset of carefully crafted 
>>>>>> queries or something.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Blake
>>>>>> 
>>>>>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Very cool!  I'll need to spent some time reading this over.  One thing 
>>>>>>> I did notice is this:
>>>>>>> 
>>>>>>> > Cassandra promises partition level write atomicity. This means that, 
>>>>>>> > although writes are eventually consistent, a given write will either 
>>>>>>> > be visible or not visible. You're not supposed to see a partially 
>>>>>>> > applied write. However, read repair and short read protection can 
>>>>>>> > both "tear" mutations. In the case of read repair, this is because 
>>>>>>> > the data resolver only evaluates the data included in the client 
>>>>>>> > read. So if your read only covers a portion of a write that didn't 
>>>>>>> > reach a quorum, only that portion will be repaired, breaking write 
>>>>>>> > atomicity.
>>>>>>> 
>>>>>>> Unfortunately there's more issues with this than just repair.  Since we 
>>>>>>> lack a consistency mechanism like MVCC while paginating, it's possible 
>>>>>>> to do the following:
>>>>>>> 
>>>>>>> thread A: reads a partition P with 10K rows, starts by reading the 
>>>>>>> first page
>>>>>>> thread B: another thread writes a batch to 2 rows in partition P, one 
>>>>>>> on page 1, another on page 2
>>>>>>> thread A: reads the second page of P which has the mutation.
>>>>>>> 
>>>>>>> I've worked with users who have been surprised by this behavior, 
>>>>>>> because pagination happens transparently.
>>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>>> 
>>>>>>> Jon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> 
>>>>>>> wrote:
>>>>>>>> Hello dev@,
>>>>>>>> 
>>>>>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>>>>>>>> community. CEP-45 proposes adding a replication mechanism to track and 
>>>>>>>> reconcile individual mutations, as well as processes to actively 
>>>>>>>> reconcile missing mutations.
>>>>>>>> 
>>>>>>>> For keyspaces with mutation tracking enabled, the immediate benefits 
>>>>>>>> of this CEP are:
>>>>>>>> * reduced replication lag with a continuous background reconciliation 
>>>>>>>> process
>>>>>>>> * eliminate the disk load caused by repair merkle tree calculation
>>>>>>>> * eliminate repair overstreaming
>>>>>>>> * reduce disk load of reads on cluster to close to 1/CL
>>>>>>>> * fix longstanding mutation atomicity issues caused by read repair and 
>>>>>>>> short read protection
>>>>>>>> 
>>>>>>>> Additionally, although it's outside the scope of this CEP, mutation 
>>>>>>>> tracking would enable:
>>>>>>>> * completion of witness replicas / transient replication, making the 
>>>>>>>> feature usable for all workloads
>>>>>>>> * lightweight witness only datacenters
>>>>>>>> 
>>>>>>>> The CEP is linked here: 
>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>>>>>  but please keep the discussion on the dev list.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Blake Eggleston

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to