>> 2) What is a granularity of storing mutation ids in memtable, is it per cell? It would be per-partition
I suppose we have a kind of trade-off here: granularity of such metadata vs probability of read repair in some cases.. An example: if there is a big enough partition (like a time slot) to which one we append frequently data and we read sometimes from this table a single row or a small range of clustering keys then by comparing ids on partition level we may get a higher chance of mismatch and read repair compared to the current logic when we check mismatch only for the fetched data.. On Wed, 8 Jan 2025 at 21:47, Jon Haddad <j...@rustyrazorblade.com> wrote: > > It's true that we can't offer multi-page write atomicity without some > sort of MVCC. There are a lot of common query patterns that don't involve > paging though, so it's not like the benefit of fixing write atomicity would > only apply to a small subset of carefully crafted queries or something. > > Sure, it'll work a lot, but we don't say "partition level write atomicity > some of the time". We say guarantee. From the CEP: > > > In the case of read repair, since we are only reading and correcting > the parts of a partition that we're reading and not the entire contents of > a partition on each read, read repair can break our *guarantee* on > partition level write atomicity. This approach also prevents meeting the > monotonic read requirement for witness replicas, which has significantly > limited its usefulness. > > I point this out because it's not well known, and we make a guarantee that > isn't true, and while the CEP will reduce the number of cases in which we > violate the guarantee, we will still have known edge cases that it doesn't > hold up. So we should stop saying it. > > > > > On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> > wrote: > >> Thanks Dimitry and Jon, answers below >> >> 1) Is a single separate commit log expected to be created for all tables >> with the new replication type? >> >> >> The plan is to still have a single commit log, but only index mutations >> with a mutation id. >> >> 2) What is a granularity of storing mutation ids in memtable, is it per >> cell? >> >> >> It would be per-partition >> >> 3) If we update the same row multiple times while it is in a memtable - >> are all mutation ids appended to a kind of collection? >> >> >> They would yes. We might be able to do something where we stop tracking >> mutations that have been superseded by newer mutations (same cells, higher >> timestamps), but I suspect that would be more trouble than it's worth and >> would be out of scope for v1. >> >> 4) What is the expected size of a single id? >> >> >> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc >> >> 5) Do we plan to support multi-table batches (single or multi-partition) >> for this replication type? >> >> >> This is intended to support all existing features, however the tracking >> only happens at the mutation level, so the different mutations coming out >> of a multi-partition batch would all be tracked individually >> >> So even without repair mucking things up, we're unable to fulfill this >> promise except under the specific, ideal circumstance of querying a >> partition with only 1 page. >> >> >> It's true that we can't offer multi-page write atomicity without some >> sort of MVCC. There are a lot of common query patterns that don't involve >> paging though, so it's not like the benefit of fixing write atomicity would >> only apply to a small subset of carefully crafted queries or something. >> >> Thanks, >> >> Blake >> >> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: >> >> Very cool! I'll need to spent some time reading this over. One thing I >> did notice is this: >> >> > Cassandra promises partition level write atomicity. This means that, >> although writes are eventually consistent, a given write will either be >> visible or not visible. You're not supposed to see a partially applied >> write. However, read repair and short read protection can both "tear" >> mutations. In the case of read repair, this is because the data resolver >> only evaluates the data included in the client read. So if your read only >> covers a portion of a write that didn't reach a quorum, only that portion >> will be repaired, breaking write atomicity. >> >> Unfortunately there's more issues with this than just repair. Since we >> lack a consistency mechanism like MVCC while paginating, it's possible to >> do the following: >> >> thread A: reads a partition P with 10K rows, starts by reading the first >> page >> thread B: another thread writes a batch to 2 rows in partition P, one on >> page 1, another on page 2 >> thread A: reads the second page of P which has the mutation. >> >> I've worked with users who have been surprised by this behavior, because >> pagination happens transparently. >> >> So even without repair mucking things up, we're unable to fulfill this >> promise except under the specific, ideal circumstance of querying a >> partition with only 1 page. >> >> Jon >> >> >> >> >> >> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> >> wrote: >> >>> Hello dev@, >>> >>> We'd like to propose CEP-45: Mutation Tracking for adoption by the >>> community. CEP-45 proposes adding a replication mechanism to track and >>> reconcile individual mutations, as well as processes to actively reconcile >>> missing mutations. >>> >>> For keyspaces with mutation tracking enabled, the immediate benefits of >>> this CEP are: >>> * reduced replication lag with a continuous background reconciliation >>> process >>> * eliminate the disk load caused by repair merkle tree calculation >>> * eliminate repair overstreaming >>> * reduce disk load of reads on cluster to close to 1/CL >>> * fix longstanding mutation atomicity issues caused by read repair and >>> short read protection >>> >>> Additionally, although it's outside the scope of this CEP, mutation >>> tracking would enable: >>> * completion of witness replicas / transient replication, making the >>> feature usable for all workloads >>> * lightweight witness only datacenters >>> >>> The CEP is linked here: >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, >>> but please keep the discussion on the dev list. >>> >>> Thanks! >>> >>> Blake Eggleston >>> >> >> -- Dmitry Konstantinov