Maybe, I’m not really familiar enough with how “classic” MV repair works to say. You can’t mix normal repair and mutation reconciliation in the current incarnation of mutation tracking though, so I wouldn’t assume it would work with MVs.
On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote: > In the case of bitrot / losing an SSTable, wouldn't a normal repair (just the > MV against the other nodes) resolve the issue? > > On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> wrote: >> __ >> Mutation tracking is definitely an approach you could take for MVs. Mutation >> reconciliation could be extended to ensure all changes have been replicated >> to the views. When a base table received a mutation w/ an id it would >> generate a view update. If you block marking a given mutation id as >> reconciled until it’s been fully replicated to the base table and its view >> updates have been fully replicated to the views, then all view updates will >> eventually be applied as part of the log reconciliation process. >> >> A mutation tracking implementation would also allow you to be more flexible >> with the types of consistency levels you can work with, allowing users to do >> things like use LOCAL_QUORUM without leaving themselves open to introducing >> view inconsistencies. >> >> That would more or less eliminate the need for any MV repair in normal >> usage, but wouldn't address how to repair issues caused by bugs or data >> loss, though you may be able to do something with comparing the latest >> mutation ids for the base tables and its view ranges. >> >> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote: >>> I don't see mutation tracking [1] mentioned in this thread or in the CEP-48 >>> description. Not sure this would fit into the scope of this initial CEP, >>> but I have a feeling that mutation tracking could be potentially helpful to >>> reconcile base tables and views ? >>> >>> For example, when both base and view updates are acknowledged then this >>> could be somehow persisted in the view sstables mutation tracking >>> summary[2] or similar metadata ? Then these updates would be skipped during >>> view repair, considerably reducing the amount of work needed, since only >>> un-acknowledged views updates would need to be reconciled. >>> >>> [1] - >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking| >>> >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C> >>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336 >>> >>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com> >>> wrote: >>>> > - The first thing I notice is that we're talking about repairing the >>>> > entire table across the entire cluster all in one go. It's been a >>>> > *long* time since I tried to do a full repair of an entire table without >>>> > using sub-ranges. Is anyone here even doing that with clusters of >>>> > non-trivial size? How long does a full repair of a 100 node cluster >>>> > with 5TB / node take even in the best case scenario? >>>> >>>> I haven't checked the CEP yet so I may be missing out something but I >>>> think this effort doesn't need to be conflated with dense node support, to >>>> make this more approachable. I think prospective users would be OK with >>>> overprovisioning to make this feasible if needed. We could perhaps have >>>> size guardrails that limit the maximum table size per node when MVs are >>>> enabled. Ideally we should make it work for dense nodes if possible, but >>>> this shouldn't be a reason not to support the feature if it can be made to >>>> work reasonably with more resources. >>>> >>>> I think the main issue with the current MV is about correctness, and the >>>> ultimate goal of the CEP must be to provide correctness guarantees, even >>>> if it has an inevitable performance hit. I think that the performance of >>>> the repair process is definitely an important consideration and it would >>>> be helpful to have some benchmarks to have an idea of how long this repair >>>> process would take for lightweight and denser tables. >>>> >>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> >>>> wrote: >>>>> I've got several concerns around this repair process. >>>>> >>>>> - The first thing I notice is that we're talking about repairing the >>>>> entire table across the entire cluster all in one go. It's been a *long* >>>>> time since I tried to do a full repair of an entire table without using >>>>> sub-ranges. Is anyone here even doing that with clusters of non trivial >>>>> size? How long does a full repair of a 100 node cluster with 5TB / node >>>>> take even in the best case scenario? >>>>> >>>>> - Even in a scenario where sub-range repair is supported, you'd have to >>>>> scan *every* sstable on the base table in order to construct the a merkle >>>>> tree, as we don't know in advance which SSTables contain the ranges that >>>>> the MV will. That means a subrange repair would have to do a *ton* of >>>>> IO. Anyone who's mis-configured a sub-range incremental repair to use >>>>> too many ranges will probably be familiar with how long it can take to >>>>> anti-compact a bunch of SSTables. With MV sub-range repair, we'd have >>>>> even more overhead, because we'd have to read in every SSTable, every >>>>> time. If we do 10 subranges, we'll do 10x the IO of a normal repair. I >>>>> don't think this is practical. >>>>> >>>>> - Merkle trees make sense when you're comparing tables with the same >>>>> partition key, but I don't think they do when you're transforming a base >>>>> table to a view. When there's a mis-match, what's transferred? We have >>>>> a range of data in the MV, but now we have to go find that from the base >>>>> table. That means the merkle tree needs to not just track the hashes and >>>>> ranges, but the original keys it was transformed from, in order to go >>>>> find all of the matching partitions in that mis-matched range. Either >>>>> that or we end up rescanning the entire dataset in order to find the >>>>> mismatches. >>>>> >>>>> Jon >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> wrote: >>>>>> > Looking at the details of the CEP it seems to describe Paxos as >>>>>> > PaxosV1, but PaxosV2 works slightly differently (it can read during >>>>>> > the prepare phase). I assume that supporting Paxos means supporting >>>>>> > both V1 and V2 for materialized views? >>>>>> We are going to support Paxos V2. The CEP is not clear on that, we add >>>>>> this to clarify that. >>>>>> It looks like the online portion is now fairly well understood. For the >>>>>> offline repair part, I see two main concerns: one around the scalability >>>>>> of the proposed approach, and another regarding how it handles >>>>>> tombstones. >>>>>> >>>>>> *Scalability:** *I have added a _section_ <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample> in the CEP with an example to compare full repair and the proposed MV repair, the overall scalability should not be a problem. >>>>>> >>>>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, >>>>>> where each node owns one token. The base table uses (pk, ck) as its >>>>>> primary key, while the materialized view (MV) uses (ck, pk) as its >>>>>> primary key. Both tables include a value column v, which allows us to >>>>>> correlate rows between them. The dataset consists of 16 records, >>>>>> distributed as follows: >>>>>> >>>>>> >>>>>> >>>>>> *Base table** *(pk, ck, v) >>>>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1 >>>>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2 >>>>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3 >>>>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *Materialized view** *(ck, pk, v) >>>>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1 >>>>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2 >>>>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3 >>>>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4 >>>>>> >>>>>> >>>>>> The chart below compares one round of full repair with one round of MV >>>>>> repair. As shown, both scan the same total number of rows. However, MV >>>>>> repair has higher time complexity because its Merkle tree processes each >>>>>> row more intensively. To avoid all nodes scanning the entire table >>>>>> simultaneously, MV repair should use a snapshot-based approach, similar >>>>>> to normal repair with the `--sequential` option. Time complexity >>>>>> increase compare to full repair can be found in the "Complexity and >>>>>> Memory Management" section. >>>>>> >>>>>> >>>>>> >>>>>> n: number of rows >>>>>> >>>>>> d: depth of one Merkle tree for MV repair >>>>>> >>>>>> d': depth of one Merkle tree for full repair >>>>>> >>>>>> r: number of split ranges >>>>>> >>>>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r. >>>>>> >>>>>> We can see that the space complexity is the same, while MV repair has >>>>>> higher time complexity. However, this should not pose a significant >>>>>> issue in production, as the Merkle tree depth and the number of split >>>>>> ranges are typically not large. >>>>>> >>>>>> >>>>>> >>>>>> 1 Round Merkle Tree Building Complexity >>>>>> Full Repair >>>>>> MV Repair >>>>>> Time complexity O(n) O(n*d*log(r)) >>>>>> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r) >>>>>> *Tombstone:* >>>>>> >>>>>> The current proposal focuses on rebuilding the MV for a granular token >>>>>> range where a mismatch is detected, rather than rebuilding the entire MV >>>>>> token range. Since the MV is treated as a regular table, standard full >>>>>> or incremental repair processes should still apply to both the base and >>>>>> MV tables to keep their replicas in sync. >>>>>> >>>>>> Regarding tombstones, if we introduce special tombstone types or >>>>>> handling mechanisms for the MV table, we may be able to support >>>>>> tombstone synchronization between the base table and the MV. I plan to >>>>>> spend more time exploring whether we can introduce changes to the base >>>>>> table that enable this synchronization. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia >>>>>> <chovatia.jayd...@gmail.com> wrote: >>>>>>> >Like something doesn't add up here because if it always includes the >>>>>>> >base table's primary key columns that means >>>>>>> The requirement for materialized views (MVs) to include the base >>>>>>> table's primary key appears to be primarily a syntactic constraint >>>>>>> specific to Apache Cassandra. For instance, in DynamoDB, the DDL for >>>>>>> defining a Global Secondary Index does not mandate inclusion of the >>>>>>> base table's primary key. This suggests that the syntax requirement in >>>>>>> Cassandra could potentially be relaxed in the future (outside the scope >>>>>>> of this CEP). As Benedict noted, the base table's primary key is >>>>>>> optional when querying a materialized view. >>>>>>> >>>>>>> Jaydeep >>>>>>> >>>>>>> >>>>>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation >>>>>>>> > hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a >>>>>>>> > file. >>>>>>>> > Repair isnt’ just about “have all transaction commits landed”. It’s >>>>>>>> > “is the data correct N days after it’s written”. >>>>>>>> >>>>>>>> Don't forget about restoring from a backup. >>>>>>>> >>>>>>>> Is there a way we could do some sort of hybrid compaction + >>>>>>>> incremental repair? Maybe have the MV verify it's view while it's >>>>>>>> compacting, and when it's done, mark the view's SSTable as repaired? >>>>>>>> Then the repair process would only need to do a MV to MV repair. >>>>>>>> >>>>>>>> Jon >>>>>>>> >>>>>>>> >>>>>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith >>>>>>>> <bened...@apache.org> wrote: >>>>>>>>>> Like something doesn't add up here because if it always includes the >>>>>>>>>> base table's primary key columns that means they could be storage >>>>>>>>>> attached by just forbidding additional columns and there doesn't >>>>>>>>>> seem to be much utility in including additional columns in the >>>>>>>>>> primary key? >>>>>>>>> >>>>>>>>> You can re-order the keys, and they only need to be a part of the >>>>>>>>> primary key not the partition key. I think you can specify an >>>>>>>>> arbitrary order to the keys also, so you can change the effective >>>>>>>>> sort order. So, the basic idea is you stipulate something like >>>>>>>>> PRIMARY KEY ((v1),(ck1,pk1)). >>>>>>>>> >>>>>>>>> This is basically a global index, with the restriction on single >>>>>>>>> columns as keys only because we cannot cheaply read-before-write for >>>>>>>>> eventually consistent operations. This restriction can easily be >>>>>>>>> relaxed for Paxos and Accord based implementations, which can also >>>>>>>>> safely include additional keys. >>>>>>>>> >>>>>>>>> That said, I am not at all sure why they are called materialised >>>>>>>>> views if we don’t support including any other data besides the lookup >>>>>>>>> column and the primary key. We should really rename them once they >>>>>>>>> work, both to make some sense and to break with the historical >>>>>>>>> baggage. >>>>>>>>> >>>>>>>>>> I think this can be represented as a tombstone which can always be >>>>>>>>>> fetched from the base table on read or maybe some other arrangement? >>>>>>>>>> I agree it can't feasibly be represented as an enumeration of the >>>>>>>>>> deletions at least not synchronously and doing it async has its own >>>>>>>>>> problems. >>>>>>>>> If the base table must be read on read of an index/view, then I think >>>>>>>>> this proposal is approximately linearizable for the view as well >>>>>>>>> (though, I do not at all warrant this statement). You still need to >>>>>>>>> propagate this eventually so that the views can cleanup. This also >>>>>>>>> makes reads 2RT on read, which is rather costly. >>>>>>>>> >>>>>>>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I think it's worth taking a step back and looking at the current MV >>>>>>>>>> restrictions which are pretty onerous. >>>>>>>>>> >>>>>>>>>> A view must have a primary key and that primary key must conform to >>>>>>>>>> the following restrictions: >>>>>>>>>> • it must contain all the primary key columns of the base table. >>>>>>>>>> This ensures that every row of the view correspond to exactly one >>>>>>>>>> row of the base table. >>>>>>>>>> • it can only contain a single column that is not a primary key >>>>>>>>>> column in the base table. >>>>>>>>>> At that point what exactly is the value in including anything except >>>>>>>>>> the original primary key in the MV's primary key columns unless you >>>>>>>>>> are using an ordered partitioner so you can iterate based on the >>>>>>>>>> leading primary key columns? >>>>>>>>>> >>>>>>>>>> Like something doesn't add up here because if it always includes the >>>>>>>>>> base table's primary key columns that means they could be storage >>>>>>>>>> attached by just forbidding additional columns and there doesn't >>>>>>>>>> seem to be much utility in including additional columns in the >>>>>>>>>> primary key? >>>>>>>>>> >>>>>>>>>> I'm not that clear on how much better it is to look something up in >>>>>>>>>> the MV vs just looking at the base table or some non-materialized >>>>>>>>>> view of it. How exactly are these MVs supposed to be used and what >>>>>>>>>> value do they provide? >>>>>>>>>> >>>>>>>>>> Jeff Jirsa wrote: >>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>>>>>>> >>>>>>>>>> Runtian Liu pointed out that the CEP is sort of divided into two >>>>>>>>>> parts. The first is the online part which is making reads/writes to >>>>>>>>>> MVs safer and more reliable using a transaction system. The second >>>>>>>>>> is offline which is repair. >>>>>>>>>> >>>>>>>>>> The story for the online portion I think is quite strong and worth >>>>>>>>>> considering on its own merits. >>>>>>>>>> >>>>>>>>>> The offline portion (repair) sounds a little less feasible to run in >>>>>>>>>> production, but I also think that MVs without any mechanism for >>>>>>>>>> checking their consistency are not viable to run in production. So >>>>>>>>>> it's kind of pay for what you use in terms of the feature? >>>>>>>>>> >>>>>>>>>> It's definitely worth thinking through if there is a way to fix one >>>>>>>>>> side of this equation so it works better. >>>>>>>>>> >>>>>>>>>> David Capwell wrote: >>>>>>>>>> >>>>>>>>>>> As far as I can tell, being based off Accord means you don’t need >>>>>>>>>>> to care about repair, as Accord will manage the consistency for >>>>>>>>>>> you; you can’t get out of sync. >>>>>>>>>> I think a baseline requirement in C* for something to be in >>>>>>>>>> production is to be able to run preview repair and validate that the >>>>>>>>>> transaction system or any other part of Cassandra hasn't made a >>>>>>>>>> mistake. Divergence can have many sources including Accord. >>>>>>>>>> >>>>>>>>>> Runtian Liu wrote: >>>>>>>>>>> For the example David mentioned, LWT cannot support. Since LWTs >>>>>>>>>>> operate on a single token, we’ll need to restrict base-table >>>>>>>>>>> updates to one partition—and ideally one row—at a time. A current >>>>>>>>>>> MV base-table command can delete an entire partition, but doing so >>>>>>>>>>> might touch hundreds of MV partitions, making consistency >>>>>>>>>>> guarantees impossible. >>>>>>>>>> I think this can be represented as a tombstone which can always be >>>>>>>>>> fetched from the base table on read or maybe some other arrangement? >>>>>>>>>> I agree it can't feasibly be represented as an enumeration of the >>>>>>>>>> deletions at least not synchronously and doing it async has its own >>>>>>>>>> problems. >>>>>>>>>> >>>>>>>>>> Ariel >>>>>>>>>> >>>>>>>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am *big* fan of getting repair really working with MVs. It does >>>>>>>>>>>> seem problematic that the number of merkle trees will be equal to >>>>>>>>>>>> the number of ranges in the cluster and repair of MVs would become >>>>>>>>>>>> an all node operation. How would down nodes be handled and how >>>>>>>>>>>> many nodes would simultaneously working to validate a given base >>>>>>>>>>>> table range at once? How many base table ranges could >>>>>>>>>>>> simultaneously be repairing MVs? >>>>>>>>>>>> >>>>>>>>>>>> If a row containing a column that creates an MV partition is >>>>>>>>>>>> deleted, and the MV isn't updated, then how does the merkle tree >>>>>>>>>>>> approach propagate the deletion to the MV? The CEP says that >>>>>>>>>>>> anti-compaction would remove extra rows, but I am not clear on how >>>>>>>>>>>> that works. When is anti-compaction performed in the repair >>>>>>>>>>>> process and what is/isn't included in the outputs? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I thought about these two points last night after I sent my email. >>>>>>>>>>> >>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>>>>>>>> >>>>>>>>>>> One is the lack of tombstones / deletions in the merle trees, which >>>>>>>>>>> makes properly dealing with writes/deletes/inconsistency very hard >>>>>>>>>>> (afaict) >>>>>>>>>>> >>>>>>>>>>> The second is the reality that repairing a single partition in the >>>>>>>>>>> base table may repair all hosts/ranges in the MV table, and vice >>>>>>>>>>> versa. Basically scanning either base or MV is effectively scanning >>>>>>>>>>> the whole cluster (modulo what you can avoid in the clean/dirty >>>>>>>>>>> repaired sets). This makes me really, really concerned with how it >>>>>>>>>>> scales, and how likely it is to be able to schedule automatically >>>>>>>>>>> without blowing up. >>>>>>>>>>> >>>>>>>>>>> The paxos vs accord comments so far are interesting in that I think >>>>>>>>>>> both could be made to work, but I am very concerned about how the >>>>>>>>>>> merkle tree comparisons are likely to work with wide partitions >>>>>>>>>>> leading to massive fanout in ranges. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>