Maybe, I’m not really familiar enough with how “classic” MV repair works to 
say. You can’t mix normal repair and mutation reconciliation in the current 
incarnation of mutation tracking though, so I wouldn’t assume it would work 
with MVs.

On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
> In the case of bitrot / losing an SSTable, wouldn't a normal repair (just the 
> MV against the other nodes) resolve the issue?
> 
> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> wrote:
>> __
>> Mutation tracking is definitely an approach you could take for MVs. Mutation 
>> reconciliation could be extended to ensure all changes have been replicated 
>> to the views. When a base table received a mutation w/ an id it would 
>> generate a view update. If you block marking a given mutation id as 
>> reconciled until it’s been fully replicated to the base table and its view 
>> updates have been fully replicated to the views, then all view updates will 
>> eventually be applied as part of the log reconciliation process.
>> 
>> A mutation tracking implementation would also allow you to be more flexible 
>> with the types of consistency levels you can work with, allowing users to do 
>> things like use LOCAL_QUORUM without leaving themselves open to introducing 
>> view inconsistencies.
>> 
>> That would more or less eliminate the need for any MV repair in normal 
>> usage, but wouldn't address how to repair issues caused by bugs or data 
>> loss, though you may be able to do something with comparing the latest 
>> mutation ids for the base tables and its view ranges.
>> 
>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>> I don't see mutation tracking [1] mentioned in this thread or in the CEP-48 
>>> description. Not sure this would fit into the scope of this initial CEP, 
>>> but I have a feeling that mutation tracking could be potentially helpful to 
>>> reconcile base tables and views ?
>>> 
>>> For example, when both base and view updates are acknowledged then this 
>>> could be somehow persisted in the view sstables mutation tracking 
>>> summary[2] or similar metadata ? Then these updates would be skipped during 
>>> view repair, considerably reducing the amount of work needed, since only 
>>> un-acknowledged views updates would need to be reconciled.
>>> 
>>> [1] - 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>>  
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>>> 
>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com> 
>>> wrote:
>>>> > - The first thing I notice is that we're talking about repairing the 
>>>> > entire table across the entire cluster all in one go.  It's been a 
>>>> > *long* time since I tried to do a full repair of an entire table without 
>>>> > using sub-ranges.  Is anyone here even doing that with clusters of 
>>>> > non-trivial size?  How long does a full repair of a 100 node cluster 
>>>> > with 5TB / node take even in the best case scenario?
>>>> 
>>>> I haven't checked the CEP yet so I may be missing out something but I 
>>>> think this effort doesn't need to be conflated with dense node support, to 
>>>> make this more approachable. I think prospective users would be OK with 
>>>> overprovisioning to make this feasible if needed. We could perhaps have 
>>>> size guardrails that limit the maximum table size per node when MVs are 
>>>> enabled. Ideally we should make it work for dense nodes if possible, but 
>>>> this shouldn't be a reason not to support the feature if it can be made to 
>>>> work reasonably with more resources.
>>>> 
>>>> I think the main issue with the current MV is about correctness, and the 
>>>> ultimate goal of the CEP must be to provide correctness guarantees, even 
>>>> if it has an inevitable performance hit. I think that the performance of 
>>>> the repair process is definitely an important consideration and it would 
>>>> be helpful to have some benchmarks to have an idea of how long this repair 
>>>> process would take for lightweight and denser tables.
>>>> 
>>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> 
>>>> wrote:
>>>>> I've got several concerns around this repair process.
>>>>> 
>>>>> - The first thing I notice is that we're talking about repairing the 
>>>>> entire table across the entire cluster all in one go.  It's been a *long* 
>>>>> time since I tried to do a full repair of an entire table without using 
>>>>> sub-ranges.  Is anyone here even doing that with clusters of non trivial 
>>>>> size?  How long does a full repair of a 100 node cluster with 5TB / node 
>>>>> take even in the best case scenario?
>>>>> 
>>>>> - Even in a scenario where sub-range repair is supported, you'd have to 
>>>>> scan *every* sstable on the base table in order to construct the a merkle 
>>>>> tree, as we don't know in advance which SSTables contain the ranges that 
>>>>> the MV will.  That means a subrange repair would have to do a *ton* of 
>>>>> IO.  Anyone who's mis-configured a sub-range incremental repair to use 
>>>>> too many ranges will probably be familiar with how long it can take to 
>>>>> anti-compact a bunch of SSTables.  With MV sub-range repair, we'd have 
>>>>> even more overhead, because we'd have to read in every SSTable, every 
>>>>> time.  If we do 10 subranges, we'll do 10x the IO of a normal repair.  I 
>>>>> don't think this is practical.
>>>>> 
>>>>> - Merkle trees make sense when you're comparing tables with the same 
>>>>> partition key, but I don't think they do when you're transforming a base 
>>>>> table to a view.  When there's a mis-match, what's transferred?  We have 
>>>>> a range of data in the MV, but now we have to go find that from the base 
>>>>> table.  That means the merkle tree needs to not just track the hashes and 
>>>>> ranges, but the original keys it was transformed from, in order to go 
>>>>> find all of the matching partitions in that mis-matched range.  Either 
>>>>> that or we end up rescanning the entire dataset in order to find the 
>>>>> mismatches. 
>>>>> 
>>>>> Jon
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> wrote:
>>>>>> > Looking at the details of the CEP it seems to describe Paxos as 
>>>>>> > PaxosV1, but PaxosV2 works slightly differently (it can read during 
>>>>>> > the prepare phase). I assume that supporting Paxos means supporting 
>>>>>> > both V1 and V2 for materialized views?
>>>>>> We are going to support Paxos V2. The CEP is not clear on that, we add 
>>>>>> this to clarify that.
>>>>>> It looks like the online portion is now fairly well understood.  For the 
>>>>>> offline repair part, I see two main concerns: one around the scalability 
>>>>>> of the proposed approach, and another regarding how it handles 
>>>>>> tombstones.
>>>>>> 
>>>>>> *Scalability:**
*I have added a _section_ 
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample>
 in the CEP with an example to compare full repair and the proposed MV repair, 
the overall scalability should not be a problem.
>>>>>> 
>>>>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, 
>>>>>> where each node owns one token. The base table uses (pk, ck) as its 
>>>>>> primary key, while the materialized view (MV) uses (ck, pk) as its 
>>>>>> primary key. Both tables include a value column v, which allows us to 
>>>>>> correlate rows between them. The dataset consists of 16 records, 
>>>>>> distributed as follows:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> *Base table**
*(pk, ck, v)
>>>>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4)  // N1
>>>>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8)  // N2
>>>>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12)  // N3
>>>>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16)  // N4
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> *Materialized view**
*(ck, pk, v)
>>>>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13)  // N1
>>>>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14)  // N2
>>>>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15)  // N3
>>>>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16)  // N4
>>>>>> 
>>>>>> 
>>>>>> The chart below compares one round of full repair with one round of MV 
>>>>>> repair. As shown, both scan the same total number of rows. However, MV 
>>>>>> repair has higher time complexity because its Merkle tree processes each 
>>>>>> row more intensively. To avoid all nodes scanning the entire table 
>>>>>> simultaneously, MV repair should use a snapshot-based approach, similar 
>>>>>> to normal repair with the `--sequential` option. Time complexity 
>>>>>> increase compare to full repair can be found in the "Complexity and 
>>>>>> Memory Management" section.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> n: number of rows
>>>>>> 
>>>>>> d: depth of one Merkle tree for MV repair
>>>>>> 
>>>>>> d': depth of one Merkle tree for full repair
>>>>>> 
>>>>>> r: number of split ranges
>>>>>> 
>>>>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r. 
>>>>>> 
>>>>>> We can see that the space complexity is the same, while MV repair has 
>>>>>> higher time complexity. However, this should not pose a significant 
>>>>>> issue in production, as the Merkle tree depth and the number of split 
>>>>>> ranges are typically not large.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 1 Round Merkle Tree Building Complexity
>>>>>> Full Repair
>>>>>> MV Repair
>>>>>> Time complexity  O(n)    O(n*d*log(r))
>>>>>> Space complexity O((2^d')*r)     O((2^d)*r^2) = O((2^d')*r)
>>>>>> *Tombstone:*
>>>>>> 
>>>>>> The current proposal focuses on rebuilding the MV for a granular token 
>>>>>> range where a mismatch is detected, rather than rebuilding the entire MV 
>>>>>> token range. Since the MV is treated as a regular table, standard full 
>>>>>> or incremental repair processes should still apply to both the base and 
>>>>>> MV tables to keep their replicas in sync.
>>>>>> 
>>>>>> Regarding tombstones, if we introduce special tombstone types or 
>>>>>> handling mechanisms for the MV table, we may be able to support 
>>>>>> tombstone synchronization between the base table and the MV. I plan to 
>>>>>> spend more time exploring whether we can introduce changes to the base 
>>>>>> table that enable this synchronization.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia 
>>>>>> <chovatia.jayd...@gmail.com> wrote:
>>>>>>> >Like something doesn't add up here because if it always includes the 
>>>>>>> >base table's primary key columns that means
>>>>>>> The requirement for materialized views (MVs) to include the base 
>>>>>>> table's primary key appears to be primarily a syntactic constraint 
>>>>>>> specific to Apache Cassandra. For instance, in DynamoDB, the DDL for 
>>>>>>> defining a Global Secondary Index does not mandate inclusion of the 
>>>>>>> base table's primary key. This suggests that the syntax requirement in 
>>>>>>> Cassandra could potentially be relaxed in the future (outside the scope 
>>>>>>> of this CEP). As Benedict noted, the base table's primary key is 
>>>>>>> optional when querying a materialized view.
>>>>>>> 
>>>>>>> Jaydeep
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation 
>>>>>>>> > hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a 
>>>>>>>> > file.
>>>>>>>> > Repair isnt’ just about “have all transaction commits landed”. It’s 
>>>>>>>> > “is the data correct N days after it’s written”.
>>>>>>>> 
>>>>>>>> Don't forget about restoring from a backup.
>>>>>>>> 
>>>>>>>> Is there a way we could do some sort of hybrid compaction + 
>>>>>>>> incremental repair?  Maybe have the MV verify it's view while it's 
>>>>>>>> compacting, and when it's done, mark the view's SSTable as repaired?  
>>>>>>>> Then the repair process would only need to do a MV to MV repair.
>>>>>>>> 
>>>>>>>> Jon
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith 
>>>>>>>> <bened...@apache.org> wrote:
>>>>>>>>>> Like something doesn't add up here because if it always includes the 
>>>>>>>>>> base table's primary key columns that means they could be storage 
>>>>>>>>>> attached by just forbidding additional columns and there doesn't 
>>>>>>>>>> seem to be much utility in including additional columns in the 
>>>>>>>>>> primary key?
>>>>>>>>> 
>>>>>>>>> You can re-order the keys, and they only need to be a part of the 
>>>>>>>>> primary key not the partition key. I think you can specify an 
>>>>>>>>> arbitrary order to the keys also, so you can change the effective 
>>>>>>>>> sort order. So, the basic idea is you stipulate something like 
>>>>>>>>> PRIMARY KEY ((v1),(ck1,pk1)).
>>>>>>>>> 
>>>>>>>>> This is basically a global index, with the restriction on single 
>>>>>>>>> columns as keys only because we cannot cheaply read-before-write for 
>>>>>>>>> eventually consistent operations. This restriction can easily be 
>>>>>>>>> relaxed for Paxos and Accord based implementations, which can also 
>>>>>>>>> safely include additional keys.
>>>>>>>>> 
>>>>>>>>> That said, I am not at all sure why they are called materialised 
>>>>>>>>> views if we don’t support including any other data besides the lookup 
>>>>>>>>> column and the primary key. We should really rename them once they 
>>>>>>>>> work, both to make some sense and to break with the historical 
>>>>>>>>> baggage.
>>>>>>>>> 
>>>>>>>>>> I think this can be represented as a tombstone which can always be 
>>>>>>>>>> fetched from the base table on read or maybe some other arrangement? 
>>>>>>>>>> I agree it can't feasibly be represented as an enumeration of the 
>>>>>>>>>> deletions at least not synchronously and doing it async has its own 
>>>>>>>>>> problems.
>>>>>>>>> If the base table must be read on read of an index/view, then I think 
>>>>>>>>> this proposal is approximately linearizable for the view as well 
>>>>>>>>> (though, I do not at all warrant this statement). You still need to 
>>>>>>>>> propagate this eventually so that the views can cleanup. This also 
>>>>>>>>> makes reads 2RT on read, which is rather costly.
>>>>>>>>> 
>>>>>>>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I think it's worth taking a step back and looking at the current MV 
>>>>>>>>>> restrictions which are pretty onerous.
>>>>>>>>>> 
>>>>>>>>>> A view must have a primary key and that primary key must conform to 
>>>>>>>>>> the following restrictions:
>>>>>>>>>>  • it must contain all the primary key columns of the base table. 
>>>>>>>>>> This ensures that every row of the view correspond to exactly one 
>>>>>>>>>> row of the base table.
>>>>>>>>>>  • it can only contain a single column that is not a primary key 
>>>>>>>>>> column in the base table.
>>>>>>>>>> At that point what exactly is the value in including anything except 
>>>>>>>>>> the original primary key in the MV's primary key columns unless you 
>>>>>>>>>> are using an ordered partitioner so you can iterate based on the 
>>>>>>>>>> leading primary key columns?
>>>>>>>>>> 
>>>>>>>>>> Like something doesn't add up here because if it always includes the 
>>>>>>>>>> base table's primary key columns that means they could be storage 
>>>>>>>>>> attached by just forbidding additional columns and there doesn't 
>>>>>>>>>> seem to be much utility in including additional columns in the 
>>>>>>>>>> primary key?
>>>>>>>>>> 
>>>>>>>>>> I'm not that clear on how much better it is to look something up in 
>>>>>>>>>> the MV vs just looking at the base table or some non-materialized 
>>>>>>>>>> view of it. How exactly are these MVs supposed to be used and what 
>>>>>>>>>> value do they provide?
>>>>>>>>>> 
>>>>>>>>>> Jeff Jirsa wrote:
>>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>>>>>> 
>>>>>>>>>> Runtian Liu pointed out that the CEP is sort of divided into two 
>>>>>>>>>> parts. The first is the online part which is making reads/writes to 
>>>>>>>>>> MVs safer and more reliable using a transaction system. The second 
>>>>>>>>>> is offline which is repair.
>>>>>>>>>> 
>>>>>>>>>> The story for the online portion I think is quite strong and worth 
>>>>>>>>>> considering on its own merits.
>>>>>>>>>> 
>>>>>>>>>> The offline portion (repair) sounds a little less feasible to run in 
>>>>>>>>>> production, but I also think that MVs without any mechanism for 
>>>>>>>>>> checking their consistency are not viable to run in production. So 
>>>>>>>>>> it's kind of pay for what you use in terms of the feature?
>>>>>>>>>> 
>>>>>>>>>> It's definitely worth thinking through if there is a way to fix one 
>>>>>>>>>> side of this equation so it works better.
>>>>>>>>>> 
>>>>>>>>>> David Capwell wrote:
>>>>>>>>>> 
>>>>>>>>>>> As far as I can tell, being based off Accord means you don’t need 
>>>>>>>>>>> to care about repair, as Accord will manage the consistency for 
>>>>>>>>>>> you; you can’t get out of sync.
>>>>>>>>>> I think a baseline requirement in C* for something to be in 
>>>>>>>>>> production is to be able to run preview repair and validate that the 
>>>>>>>>>> transaction system or any other part of Cassandra hasn't made a 
>>>>>>>>>> mistake. Divergence can have many sources including Accord.
>>>>>>>>>> 
>>>>>>>>>> Runtian Liu wrote:
>>>>>>>>>>> For the example David mentioned, LWT cannot support. Since LWTs 
>>>>>>>>>>> operate on a single token, we’ll need to restrict base-table 
>>>>>>>>>>> updates to one partition—and ideally one row—at a time. A current 
>>>>>>>>>>> MV base-table command can delete an entire partition, but doing so 
>>>>>>>>>>> might touch hundreds of MV partitions, making consistency 
>>>>>>>>>>> guarantees impossible. 
>>>>>>>>>> I think this can be represented as a tombstone which can always be 
>>>>>>>>>> fetched from the base table on read or maybe some other arrangement? 
>>>>>>>>>> I agree it can't feasibly be represented as an enumeration of the 
>>>>>>>>>> deletions at least not synchronously and doing it async has its own 
>>>>>>>>>> problems.
>>>>>>>>>> 
>>>>>>>>>> Ariel
>>>>>>>>>> 
>>>>>>>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I am *big* fan of getting repair really working with MVs. It does 
>>>>>>>>>>>> seem problematic that the number of merkle trees will be equal to 
>>>>>>>>>>>> the number of ranges in the cluster and repair of MVs would become 
>>>>>>>>>>>> an all node operation.  How would down nodes be handled and how 
>>>>>>>>>>>> many nodes would simultaneously working to validate a given base 
>>>>>>>>>>>> table range at once? How many base table ranges could 
>>>>>>>>>>>> simultaneously be repairing MVs?
>>>>>>>>>>>> 
>>>>>>>>>>>> If a row containing a column that creates an MV partition is 
>>>>>>>>>>>> deleted, and the MV isn't updated, then how does the merkle tree 
>>>>>>>>>>>> approach propagate the deletion to the MV? The CEP says that 
>>>>>>>>>>>> anti-compaction would remove extra rows, but I am not clear on how 
>>>>>>>>>>>> that works. When is anti-compaction performed in the repair 
>>>>>>>>>>>> process and what is/isn't included in the outputs?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I thought about these two points last night after I sent my email.
>>>>>>>>>>> 
>>>>>>>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>>>>>>> 
>>>>>>>>>>> One is the lack of tombstones / deletions in the merle trees, which 
>>>>>>>>>>> makes properly dealing with writes/deletes/inconsistency very hard 
>>>>>>>>>>> (afaict)
>>>>>>>>>>> 
>>>>>>>>>>> The second is the reality that repairing a single partition in the 
>>>>>>>>>>> base table may repair all hosts/ranges in the MV table, and vice 
>>>>>>>>>>> versa. Basically scanning either base or MV is effectively scanning 
>>>>>>>>>>> the whole cluster (modulo what you can avoid in the clean/dirty 
>>>>>>>>>>> repaired sets). This makes me really, really concerned with how it 
>>>>>>>>>>> scales, and how likely it is to be able to schedule automatically 
>>>>>>>>>>> without blowing up. 
>>>>>>>>>>> 
>>>>>>>>>>> The paxos vs accord comments so far are interesting in that I think 
>>>>>>>>>>> both could be made to work, but I am very concerned about how the 
>>>>>>>>>>> merkle tree comparisons are likely to work with wide partitions 
>>>>>>>>>>> leading to massive fanout in ranges. 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>> 

Reply via email to