Re: [DISCUSS] CEP-45: Mutation Tracking

Dmitry Konstantinov Wed, 08 Jan 2025 14:02:23 -0800

>> 2) What is a granularity of storing mutation ids in memtable, is it per
cell?
It would be per-partition


I suppose we have a kind of trade-off here: granularity of such metadata vs
probability of read repair in some cases.. An example: if there is a big
enough partition (like a time slot) to which one we append frequently data
and we read sometimes from this table a single row or a small range of
clustering keys then by comparing ids on partition level we may get a
higher chance of mismatch and read repair compared to the current logic
when we check mismatch only for the fetched data..

On Wed, 8 Jan 2025 at 21:47, Jon Haddad <j...@rustyrazorblade.com> wrote:

> > It's true that we can't offer multi-page write atomicity without some
> sort of MVCC. There are a lot of common query patterns that don't involve
> paging though, so it's not like the benefit of fixing write atomicity would
> only apply to a small subset of carefully crafted queries or something.
>
> Sure, it'll work a lot, but we don't say "partition level write atomicity
> some of the time".  We say guarantee.  From the CEP:
>
> > In the case of read repair, since we are only reading and correcting
> the parts of a partition that we're reading and not the entire contents of
> a partition on each read, read repair can break our *guarantee* on
> partition level write atomicity. This approach also prevents meeting the
> monotonic read requirement for witness replicas, which has significantly
> limited its usefulness.
>
> I point this out because it's not well known, and we make a guarantee that
> isn't true, and while the CEP will reduce the number of cases in which we
> violate the guarantee, we will still have known edge cases that it doesn't
> hold up.  So we should stop saying it.
>
>
>
>
> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com>
> wrote:
>
>> Thanks Dimitry and Jon, answers below
>>
>> 1) Is a single separate commit log expected to be created for all tables
>> with the new replication type?
>>
>>
>> The plan is to still have a single commit log, but only index mutations
>> with a mutation id.
>>
>> 2) What is a granularity of storing mutation ids in memtable, is it per
>> cell?
>>
>>
>> It would be per-partition
>>
>> 3) If we update the same row multiple times while it is in a memtable -
>> are all mutation ids appended to a kind of collection?
>>
>>
>> They would yes. We might be able to do something where we stop tracking
>> mutations that have been superseded by newer mutations (same cells, higher
>> timestamps), but I suspect that would be more trouble than it's worth and
>> would be out of scope for v1.
>>
>> 4) What is the expected size of a single id?
>>
>>
>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
>>
>> 5) Do we plan to support multi-table batches (single or multi-partition)
>> for this replication type?
>>
>>
>> This is intended to support all existing features, however the tracking
>> only happens at the mutation level, so the different mutations coming out
>> of a multi-partition batch would all be tracked individually
>>
>> So even without repair mucking things up, we're unable to fulfill this
>> promise except under the specific, ideal circumstance of querying a
>> partition with only 1 page.
>>
>>
>> It's true that we can't offer multi-page write atomicity without some
>> sort of MVCC. There are a lot of common query patterns that don't involve
>> paging though, so it's not like the benefit of fixing write atomicity would
>> only apply to a small subset of carefully crafted queries or something.
>>
>> Thanks,
>>
>> Blake
>>
>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>>
>> Very cool!  I'll need to spent some time reading this over.  One thing I
>> did notice is this:
>>
>> > Cassandra promises partition level write atomicity. This means that,
>> although writes are eventually consistent, a given write will either be
>> visible or not visible. You're not supposed to see a partially applied
>> write. However, read repair and short read protection can both "tear"
>> mutations. In the case of read repair, this is because the data resolver
>> only evaluates the data included in the client read. So if your read only
>> covers a portion of a write that didn't reach a quorum, only that portion
>> will be repaired, breaking write atomicity.
>>
>> Unfortunately there's more issues with this than just repair.  Since we
>> lack a consistency mechanism like MVCC while paginating, it's possible to
>> do the following:
>>
>> thread A: reads a partition P with 10K rows, starts by reading the first
>> page
>> thread B: another thread writes a batch to 2 rows in partition P, one on
>> page 1, another on page 2
>> thread A: reads the second page of P which has the mutation.
>>
>> I've worked with users who have been surprised by this behavior, because
>> pagination happens transparently.
>>
>> So even without repair mucking things up, we're unable to fulfill this
>> promise except under the specific, ideal circumstance of querying a
>> partition with only 1 page.
>>
>> Jon
>>
>>
>>
>>
>>
>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com>
>> wrote:
>>
>>> Hello dev@,
>>>
>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the
>>> community. CEP-45 proposes adding a replication mechanism to track and
>>> reconcile individual mutations, as well as processes to actively reconcile
>>> missing mutations.
>>>
>>> For keyspaces with mutation tracking enabled, the immediate benefits of
>>> this CEP are:
>>> * reduced replication lag with a continuous background reconciliation
>>> process
>>> * eliminate the disk load caused by repair merkle tree calculation
>>> * eliminate repair overstreaming
>>> * reduce disk load of reads on cluster to close to 1/CL
>>> * fix longstanding mutation atomicity issues caused by read repair and
>>> short read protection
>>>
>>> Additionally, although it's outside the scope of this CEP, mutation
>>> tracking would enable:
>>> * completion of witness replicas / transient replication, making the
>>> feature usable for all workloads
>>> * lightweight witness only datacenters
>>>
>>> The CEP is linked here:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>> but please keep the discussion on the dev list.
>>>
>>> Thanks!
>>>
>>> Blake Eggleston
>>>
>>
>>

-- 
Dmitry Konstantinov

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to