Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Josh McKenzie Fri, 01 Aug 2025 06:42:21 -0700

Definitely want to avoid scope creep, *however*... ;)

What if the CEP includes an interface for MV repair that calls out to some user 
pluggable solution and the spark-based solution you've developed is the first / 
only reference solution available at the outset? That way we could integrate it 
into the control plane (nodetool, JMX, a post CEP-37 world), have a solution 
available for those comfortable taking on the spark dependency, and have a 
clear paved path for future anti-entropy work on where to fit into the 
ecosystem.


On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote:
> Based on our discussion, it seems we’ve reached broad agreement on the 
> hot‑path optimization for strict MV consistency mode. We still have concerns 
> about the different approaches for the repair path. I’ve updated the CEP’s 
> scope and retitled it *‘Cassandra Materialized View Consistency, Reliability 
> & Backfill Enhancement’.* I removed the repair section and added a backfill 
> section that proposes an optimized strategy to backfill MVs from the base 
> table. Please take a look and share your thoughts. I believe this backfill 
> strategy will also benefit future MV repair work.
> 
> Regarding repair,
> 
> While repair is essential to maintain Materialized View (MV) consistency in 
> the face of one-off bugs or bit rot, implementing a robust, native repair 
> mechanism within Apache Cassandra remains a highly complex and still largely 
> theoretical challenge, irrespective of the strategies considered in prior 
> discussions.
> 
> To ensure steady progress, we propose addressing the problem incrementally. 
> This CEP focuses on making the hot-path & backfill reliable, laying a solid 
> foundation. A future CEP can then address repair mechanisms within the Apache 
> Cassandra ecosystem in a more focused and well-scoped manner. 
> 
> For now, the proposed CEP clarifies that users may choose to either evaluate 
> the external Spark-based repair tool 
> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> (which is 
> not part of this proposal) or develop their own repair solution tailored to 
> their operational needs.
> 
> On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu <[email protected]> wrote:
>> I believe that, with careful design, we could make the row-level repair work 
>> using the index-based approach.
>> However, regarding the CEP-proposed solution, I want to highlight that the 
>> entire repair job can be divided into two parts:
>> 
>>  1. Inconsistency detection
>> 
>>  2. Rebuilding the inconsistent ranges
>> 
>> The second step could look like this:
>> 
>> nodetool mv_rebuild --base_table_range <b_start,b_end> --mv_ranges 
>> [<mv_start1, mv_end1>, <mv_start2, mv_end2>, ...]
>> 
>> 
>> 
>> The base table range and MV ranges would come from the first step. But it’s 
>> also possible to run this rebuild directly—without detecting 
>> inconsistencies—if we already know that some nodes need repair. This means 
>> we don’t have to wait for the snapshot processing to start the repair. 
>> Snapshot-based detection works well when the cluster is generally healthy.
>> 
>> Compared to the current rebuild, the proposed approach doesn’t require 
>> dropping the MV and rebuilding it from scratch across the entire 
>> cluster—which is a major blocker for production environments.
>> 
>> That said, for the index-based approach, I still think it introduces 
>> additional load on both the write path and compaction. This increases the 
>> resource cost of using MVs. While it’s a nice feature to enable row-level MV 
>> repair, its implementation is complex. On the other hand, the original 
>> proposal’s rebuild strategy is more versatile and likely to perform 
>> better—especially when only a few nodes need repair. In contrast, 
>> large-scale row-level comparisons via the index-based approach could be 
>> prohibitively expensive. Just to clarify my intention here: this is not a 
>> complete 'no' to the index-based approach. However, for the initial version, 
>> I believe it's more prudent to avoid impacting the hot path. We can start 
>> with a simpler design that keeps the critical path untouched, and then 
>> iterate to make it more robust over time—much like how other features in 
>> Apache Cassandra have evolved. For example, 'full repair' came first, 
>> followed by 'incremental repair'; similarly, STCS was later complemented by 
>> LCS. This phased evolution allows us to balance safety, stability, and 
>> long-term capability.
>> 
>> Given all this, I still prefer to move forward with the original proposal. 
>> It allows us to avoid paying the repair overhead most of the time.
>> 
>> 
>> 
>> On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston <[email protected]> wrote:
>>> __
>>> Those are both fair points. Once you start dealing with data loss though, 
>>> maintaining guarantees is often impossible, so I’m not sure that torn 
>>> writes or updated timestamps are dealbreakers, but I’m fine with tabling 
>>> option 2 for now and seeing if we can figure out something better.
>>> 
>>> Regarding the assassin cells, if you wanted to avoid explicitly agreeing on 
>>> a value, you might be able to only issue them for repaired base data, which 
>>> has been implicitly agreed upon.
>>> 
>>> I think that or something like it is worth exploring. The idea would be to 
>>> solve this issue as completely as anti-compaction would - but without 
>>> having to rewrite sstables. I’d be interested to hear any ideas you have 
>>> about how that might work.
>>> 
>>> You basically need a mechanism to erase some piece of data that was written 
>>> before a given wall clock time - regardless of cell timestamp, and without 
>>> precluding future updates (in wall clock time) with earlier timestamps.
>>> 
>>> On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
>>>> In the second option, we use the repair timestamp to re-update any cell or 
>>>> row we want to fix in the base table. This approach is problematic because 
>>>> it alters the write time of user-supplied data. Although Cassandra does 
>>>> not allow users to set timestamps for LWT writes, users may still rely on 
>>>> the update time. A key limitation of this approach is that it cannot fix 
>>>> cases where a view cell ends up in a future state while the base table 
>>>> remains correct. I now understand your point that Cassandra cannot handle 
>>>> this scenario today. However, as I mentioned earlier, the important 
>>>> distinction is that when this issue occurs in the base table, we accept 
>>>> the "incorrect" data as valid—but this is not acceptable for materialized 
>>>> views, since the source of truth (the base table) still holds the correct 
>>>> data.
>>>> 
>>>> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston <[email protected]> 
>>>> wrote:
>>>>> __
>>>>> > Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>>>>> > sooner.
>>>>> 
>>>>> No worries, I’ll be taking some time off soon as well.
>>>>> 
>>>>> > I don’t think the first or second option is ideal. We should treat the 
>>>>> > base table as the source of truth. Modifying it—or forcing an update on 
>>>>> > it, even if it’s just a timestamp change—is not a good approach and 
>>>>> > won’t solve all problems.
>>>>> 
>>>>> I agree the first option probably isn’t the right way to go. Could you 
>>>>> say a bit more about why the second option is not a good approach and 
>>>>> which problems it won’t solve?
>>>>> 
>>>>> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>>>>>> Sorry, Blake—I was traveling last week and couldn’t reply to your email 
>>>>>> sooner.
>>>>>> 
>>>>>> > First - we interpret view data with higher timestamps than the base 
>>>>>> > table as data that’s missing from the base and replicate it into the 
>>>>>> > base table. The timestamp of the missing data may be below the paxos 
>>>>>> > timestamp low bound so we’d have to adjust the paxos coordination 
>>>>>> > logic to allow that in this case. Depending on how the view got this 
>>>>>> > way it may also tear writes to the base table, breaking the write 
>>>>>> > atomicity promise.
>>>>>> 
>>>>>> As discussed earlier, we want this MV repair mechanism to handle all 
>>>>>> edge cases. However, it would be difficult to design it in a way that 
>>>>>> detects the root cause of each mismatch and repairs it accordingly. 
>>>>>> Additionally, as you mentioned, this approach could introduce other 
>>>>>> issues, such as violating the write atomicity guarantee.
>>>>>> 
>>>>>> > Second - If this happens it means that we’ve either lost base table 
>>>>>> > data or paxos metadata. If that happened, we could force a base table 
>>>>>> > update that rewrites the current base state with new timestamps making 
>>>>>> > the extra view data removable. However this wouldn’t fix the case 
>>>>>> > where the view cell has a timestamp from the future - although that’s 
>>>>>> > not a case that C* can fix today either.
>>>>>> 
>>>>>> I don’t think the first or second option is ideal. We should treat the 
>>>>>> base table as the source of truth. Modifying it—or forcing an update on 
>>>>>> it, even if it’s just a timestamp change—is not a good approach and 
>>>>>> won’t solve all problems.
>>>>>> 
>>>>>> > the idea to use anti-compaction makes a lot more sense now (in 
>>>>>> > principle - I don’t think it’s workable in practice)
>>>>>> 
>>>>>> I have one question regarding anti-compaction. Is the main concern that 
>>>>>> processing too much data during anti-compaction could cause issues for 
>>>>>> the cluster? 
>>>>>> 
>>>>>> > I guess you could add some sort of assassin cell that is meant to 
>>>>>> > remove a cell with a specific timestamp and value, but is otherwise 
>>>>>> > invisible. 
>>>>>> 
>>>>>> The idea of the assassination cell is interesting. To prevent data from 
>>>>>> being incorrectly removed during the repair process, we need to ensure a 
>>>>>> quorum of nodes is available and agrees on the same value before 
>>>>>> repairing a materialized view (MV) row or cell. However, this could be 
>>>>>> very expensive, as it requires coordination to repair even a single row.
>>>>>> 
>>>>>> I think there are a few key differences between MV repair and normal 
>>>>>> anti-entropy repair:
>>>>>> 
>>>>>>  1. For normal anti-entropy repair, there is no single source of truth.
>>>>>> All replicas act as sources of truth, and eventual consistency is 
>>>>>> achieved by merging data across them. Even if some data is lost or bugs 
>>>>>> result in future timestamps, the replicas will eventually converge on 
>>>>>> the same (possibly incorrect) value, and that value becomes the accepted 
>>>>>> truth. In contrast, MV repair relies on the base table as the source of 
>>>>>> truth and modifies the MV if inconsistencies are detected. This means 
>>>>>> that in some cases, simply merging data won’t resolve the issue, since 
>>>>>> Cassandra resolves conflicts using timestamps and there’s no guarantee 
>>>>>> the base table will always "win" unless we change the MV merging 
>>>>>> logic—which I’ve been trying to avoid.
>>>>>> 
>>>>>>  2. I see MV repair as more of an on-demand operation, whereas normal 
>>>>>> anti-entropy repair needs to run regularly. This means we shouldn’t 
>>>>>> treat MV repair the same way as existing repairs. When an operator 
>>>>>> initiates MV repair, they need to ensure that sufficient resources are 
>>>>>> available to support it.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <[email protected]> 
>>>>>> wrote:
>>>>>>> __
>>>>>>> Got it, thanks for clearing that up. I’d seen the strict liveness code 
>>>>>>> around but didn’t realize it was MV related and hadn’t dug into what it 
>>>>>>> did or how it worked. 
>>>>>>> 
>>>>>>> I think you’re right about the row liveness update working for extra 
>>>>>>> data with timestamps lower than the most recent base table update.
>>>>>>> 
>>>>>>> I see what you mean about the timestamp from the future case. I thought 
>>>>>>> of 3 options:
>>>>>>> 
>>>>>>> First - we interpret view data with higher timestamps than the base 
>>>>>>> table as data that’s missing from the base and replicate it into the 
>>>>>>> base table. The timestamp of the missing data may be below the paxos 
>>>>>>> timestamp low bound so we’d have to adjust the paxos coordination logic 
>>>>>>> to allow that in this case. Depending on how the view got this way it 
>>>>>>> may also tear writes to the base table, breaking the write atomicity 
>>>>>>> promise.
>>>>>>> 
>>>>>>> Second - If this happens it means that we’ve either lost base table 
>>>>>>> data or paxos metadata. If that happened, we could force a base table 
>>>>>>> update that rewrites the current base state with new timestamps making 
>>>>>>> the extra view data removable. However this wouldn’t fix the case where 
>>>>>>> the view cell has a timestamp from the future - although that’s not a 
>>>>>>> case that C* can fix today either.
>>>>>>> 
>>>>>>> Third - we add a new tombstone type or some mechanism to delete 
>>>>>>> specific cells that doesn’t preclude correct writes with lower 
>>>>>>> timestamps from being visible. I’m not sure how this would work, and 
>>>>>>> the idea to use anti-compaction makes a lot more sense now (in 
>>>>>>> principle - I don’t think it’s workable in practice). I guess you could 
>>>>>>> add some sort of assassin cell that is meant to remove a cell with a 
>>>>>>> specific timestamp and value, but is otherwise invisible. This seems 
>>>>>>> dangerous though, since it’s likely there’s a replication problem that 
>>>>>>> may resolve itself and the repair process would actually be removing 
>>>>>>> data that the user intended to write.
>>>>>>> 
>>>>>>> Paulo - I don’t think storage changes are off the table, but they do 
>>>>>>> expand the scope and risk of the proposal, so we should be careful.
>>>>>>> 
>>>>>>> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote:
>>>>>>>>  > I’m not sure if this is the only edge case—there may be other 
>>>>>>>> issues as well. I’m also unsure whether we should redesign the 
>>>>>>>> tombstone handling for MVs, since that would involve changes to the 
>>>>>>>> storage engine. To minimize impact there, the original proposal was to 
>>>>>>>> rebuild the affected ranges using anti-compaction, just to be safe.
>>>>>>>> 
>>>>>>>> I haven't been following the discussion but I think one of the issues 
>>>>>>>> with the materialized view "strict liveness" fix[1] is that we avoided 
>>>>>>>> making invasive changes to the storage engine at the time, but this 
>>>>>>>> was considered by Zhao on [1]. I think we shouldn't be trying to avoid 
>>>>>>>> updates to the storage format as part of the MV implementation, if 
>>>>>>>> this is what it takes to make MVs V2 reliable.
>>>>>>>> 
>>>>>>>> [1] - 
>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603
>>>>>>>> 
>>>>>>>> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <[email protected]> wrote:
>>>>>>>>> The current design leverages strict liveness to shadow the old view 
>>>>>>>>> row. When the view-indexed value changes from 'a' to 'b', no 
>>>>>>>>> tombstone is written; instead, the old row is marked as expired by 
>>>>>>>>> updating its liveness info with the timestamp of the change. If the 
>>>>>>>>> column is later set back to 'a', the view row is re-inserted with a 
>>>>>>>>> new, non-expired liveness info reflecting the latest timestamp.
>>>>>>>>> 
>>>>>>>>> To delete an extra row in the materialized view (MV), we can likely 
>>>>>>>>> use the same approach—marking it as shadowed by updating the liveness 
>>>>>>>>> info. However, in the case of inconsistencies where a column in the 
>>>>>>>>> MV has a higher timestamp than the corresponding column in the base 
>>>>>>>>> table, this row-level liveness mechanism is insufficient.
>>>>>>>>> 
>>>>>>>>> Even for the case where we delete the row by marking its liveness 
>>>>>>>>> info as expired during repair, there are concerns. Since this 
>>>>>>>>> introduces a data mutation as part of the repair process, it’s 
>>>>>>>>> unclear whether there could be edge cases we’re missing. This 
>>>>>>>>> approach may risk unexpected side effects if the repair logic is not 
>>>>>>>>> carefully aligned with write path semantics.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> __
>>>>>>>>>> That’s a good point, although as described I don’t think that could 
>>>>>>>>>> ever work properly, even in normal operation. Either we’re 
>>>>>>>>>> misunderstanding something, or this is a flaw in the current MV 
>>>>>>>>>> design.
>>>>>>>>>> 
>>>>>>>>>> Assuming changing the view indexed column results in a tombstone 
>>>>>>>>>> being applied to the view row for the previous value, if we wrote 
>>>>>>>>>> the other base columns (the non view indexed ones) to the view with 
>>>>>>>>>> the same timestamps they have on the base, then changing the view 
>>>>>>>>>> indexed value from ‘a’ to ‘b’, then back to ‘a’ would always cause 
>>>>>>>>>> this problem. I think you’d need to always update the column 
>>>>>>>>>> timestamps on the view to be >= the view column timestamp on the base
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>>>>>>>>>>> > In the case of a missed update, we'll have a new value and we can 
>>>>>>>>>>> > send a tombstone to the view with the timestamp of the most 
>>>>>>>>>>> > recent update.
>>>>>>>>>>> 
>>>>>>>>>>> > then something has gone wrong and we should issue a tombstone 
>>>>>>>>>>> > using the paxos repair timestamp as the tombstone timestamp.
>>>>>>>>>>> 
>>>>>>>>>>> The current MV implementation uses “strict liveness” to determine 
>>>>>>>>>>> whether a row is live. I believe that using regular tombstones 
>>>>>>>>>>> during repair could cause problems. For example, consider a base 
>>>>>>>>>>> table with schema (pk, ck, v1, v2) and a materialized view with 
>>>>>>>>>>> schema (v1, pk, ck) -> v2. If, for some reason, we detect an extra 
>>>>>>>>>>> row in the MV and delete it using a tombstone with the latest 
>>>>>>>>>>> update timestamp, we may run into issues. Suppose we later update 
>>>>>>>>>>> the base table’s v1 field to match the MV row we previously 
>>>>>>>>>>> deleted, and the v2 value now has an older timestamp. In that case, 
>>>>>>>>>>> the previously issued tombstone could still shadow the v2 column, 
>>>>>>>>>>> which is unintended. That is why I was asking if we are going to 
>>>>>>>>>>> introduce a new kind of tombstones. I’m not sure if this is the 
>>>>>>>>>>> only edge case—there may be other issues as well. I’m also unsure 
>>>>>>>>>>> whether we should redesign the tombstone handling for MVs, since 
>>>>>>>>>>> that would involve changes to the storage engine. To minimize 
>>>>>>>>>>> impact there, the original proposal was to rebuild the affected 
>>>>>>>>>>> ranges using anti-compaction, just to be safe.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston 
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> __
>>>>>>>>>>>>>  Extra row in MV (assuming the tombstone is gone in the base 
>>>>>>>>>>>>> table) — how should we fix this?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> This would mean that the base table had either updated or deleted 
>>>>>>>>>>>> a row and the view didn't receive the corresponding delete.
>>>>>>>>>>>> 
>>>>>>>>>>>> In the case of a missed update, we'll have a new value and we can 
>>>>>>>>>>>> send a tombstone to the view with the timestamp of the most recent 
>>>>>>>>>>>> update. Since timestamps issued by paxos and accord writes are 
>>>>>>>>>>>> always increasing monotonically and don't have collisions, this is 
>>>>>>>>>>>> safe.
>>>>>>>>>>>> 
>>>>>>>>>>>> In the case of a row deletion, we'd also want to send a tombstone 
>>>>>>>>>>>> with the same timestamp, however since tombstones can be purged, 
>>>>>>>>>>>> we may not have that information and would have to treat it like 
>>>>>>>>>>>> the view has a higher timestamp than the base table.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Inconsistency (timestamps don’t match) — it’s easy to fix when 
>>>>>>>>>>>>> the base table has higher timestamps, but how do we resolve it 
>>>>>>>>>>>>> when the MV columns have higher timestamps?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> There are 2 ways this could happen. First is that a write failed 
>>>>>>>>>>>> and paxos repair hasn't completed it, which is expected, and the 
>>>>>>>>>>>> second is a replication bug or base table data loss. You'd need to 
>>>>>>>>>>>> compare the view timestamp to the paxos repair history to tell 
>>>>>>>>>>>> which it is. If the view timestamp is higher than the most recent 
>>>>>>>>>>>> paxos repair timestamp for the key, then it may just be a failed 
>>>>>>>>>>>> write and we should do nothing. If the view timestamp is less than 
>>>>>>>>>>>> the most recent paxos repair timestamp for that key and higher 
>>>>>>>>>>>> than the base timestamp, then something has gone wrong and we 
>>>>>>>>>>>> should issue a tombstone using the paxos repair timestamp as the 
>>>>>>>>>>>> tombstone timestamp. This is safe to do because the paxos repair 
>>>>>>>>>>>> timestamps act as a low bound for ballots paxos will process, so 
>>>>>>>>>>>> it wouldn't be possible for a legitimate write to be shadowed by 
>>>>>>>>>>>> this tombstone.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the 
>>>>>>>>>>>>> rows in the MV for cases 2 and 3? If yes, how will this tombstone 
>>>>>>>>>>>>> work? If no, how should we fix the MV data?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> No, a normal tombstone would work.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>>>>>>>>>>>>> Okay, let’s put the efficiency discussion on hold for now. I want 
>>>>>>>>>>>>> to make sure the actual repair process after detecting 
>>>>>>>>>>>>> inconsistencies will work with the index-based solution.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When a mismatch is detected, the MV replica will need to stream 
>>>>>>>>>>>>> its index file to the base table replica. The base table will 
>>>>>>>>>>>>> then perform a comparison between the two files.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There are three cases we need to handle:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  1. Missing row in MV — this is straightforward; we can propagate 
>>>>>>>>>>>>> the data to the MV.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  2. Extra row in MV (assuming the tombstone is gone in the base 
>>>>>>>>>>>>> table) — how should we fix this?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  3. Inconsistency (timestamps don’t match) — it’s easy to fix 
>>>>>>>>>>>>> when the base table has higher timestamps, but how do we resolve 
>>>>>>>>>>>>> it when the MV columns have higher timestamps?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the 
>>>>>>>>>>>>> rows in the MV for cases 2 and 3? If yes, how will this tombstone 
>>>>>>>>>>>>> work? If no, how should we fix the MV data?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston 
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>> __
>>>>>>>>>>>>>> > hopefully we can come up with a solution that everyone agrees 
>>>>>>>>>>>>>> > on.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I’m sure we can, I think we’ve been making good progress
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > My main concern with the index-based solution is the overhead 
>>>>>>>>>>>>>> > it adds to the hot path, as well as having to build indexes 
>>>>>>>>>>>>>> > periodically.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So the additional overhead of maintaining a storage attached 
>>>>>>>>>>>>>> index on the client write path is pretty minimal - it’s 
>>>>>>>>>>>>>> basically adding data to an in memory trie. It’s a little extra 
>>>>>>>>>>>>>> work and memory usage, but there isn’t any extra io or other 
>>>>>>>>>>>>>> blocking associated with it. I’d expect the latency impact to be 
>>>>>>>>>>>>>> negligible.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > As mentioned earlier, this MV repair should be an infrequent 
>>>>>>>>>>>>>> > operation
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I don’t this that’s a safe assumption. There are a lot of 
>>>>>>>>>>>>>> situations outside of data loss bugs where repair would need to 
>>>>>>>>>>>>>> be run. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These use cases could probably be handled by repairing the view 
>>>>>>>>>>>>>> with other view replicas:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Scrubbing corrupt sstables
>>>>>>>>>>>>>> Node replacement via backup
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These use cases would need an actual MV repair to check 
>>>>>>>>>>>>>> consistency with the base table:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Restoring a cluster from a backup
>>>>>>>>>>>>>> Imported sstables via nodetool import
>>>>>>>>>>>>>> Data loss from operator error
>>>>>>>>>>>>>> Proactive consistency checks - ie preview repairs
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Even if it is an infrequent operation, when operators need it, 
>>>>>>>>>>>>>> it needs to be available and reliable.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It’s a fact that there are clusters where non-incremental 
>>>>>>>>>>>>>> repairs are run on a cadence of a week or more to manage the 
>>>>>>>>>>>>>> overhead of validation compactions. Assuming the cluster doesn’t 
>>>>>>>>>>>>>> have any additional headroom, that would mean that any one of 
>>>>>>>>>>>>>> the above events could cause views to remain out of sync for up 
>>>>>>>>>>>>>> to a week while the full set of merkle trees is being built.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This delay eliminates a lot of the value of repair as a risk 
>>>>>>>>>>>>>> mitigation tool. If I had to make a recommendation where a bad 
>>>>>>>>>>>>>> call could cost me my job, the prospect of a 7 day delay on 
>>>>>>>>>>>>>> repair would mean a strong no.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some users also run preview repair continuously to detect data 
>>>>>>>>>>>>>> consistency errors, so at least a subset of users will probably 
>>>>>>>>>>>>>> be running MV repairs continuously - at least in preview mode.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That’s why I say that the replication path should be designed to 
>>>>>>>>>>>>>> never need repair, and MV repair should be designed to be 
>>>>>>>>>>>>>> prepared for the worst.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > I’m wondering if it’s possible to enable or disable index 
>>>>>>>>>>>>>> > building dynamically so that we don’t always incur the cost 
>>>>>>>>>>>>>> > for something that’s rarely needed.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think this would be a really reasonable compromise as long as 
>>>>>>>>>>>>>> the default is on. That way it’s as safe as possible by default, 
>>>>>>>>>>>>>> but users who don’t care or have a separate system for repairing 
>>>>>>>>>>>>>> MVs can opt out.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > I’m not sure what you mean by “data problems” here.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I mean out of sync views - either due to bugs, operator error, 
>>>>>>>>>>>>>> corruption, etc
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > Also, this does scale with cluster size—I’ve compared it to 
>>>>>>>>>>>>>> > full repair, and this MV repair should behave similarly. That 
>>>>>>>>>>>>>> > means as long as full repair works, this repair should work as 
>>>>>>>>>>>>>> > well.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You could build the merkle trees at about the same cost as a 
>>>>>>>>>>>>>> full repair, but the actual data repair path is completely 
>>>>>>>>>>>>>> different for MV, and that’s the part that doesn’t scale well. 
>>>>>>>>>>>>>> As you know, with normal repair, we just stream data for ranges 
>>>>>>>>>>>>>> detected as out of sync. For Mvs, since the data isn’t in base 
>>>>>>>>>>>>>> partition order, the view data for an out of sync view range 
>>>>>>>>>>>>>> needs to be read out and streamed to every base replica that 
>>>>>>>>>>>>>> it’s detected a mismatch against. So in the example I gave with 
>>>>>>>>>>>>>> the 300 node cluster, you’re looking at reading and transmitting 
>>>>>>>>>>>>>> the same partition at least 100 times in the best case, and the 
>>>>>>>>>>>>>> cost of this keeps going up as the cluster increases in size. 
>>>>>>>>>>>>>> That's the part that doesn't scale well. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is also one the benefits of the index design. Since it 
>>>>>>>>>>>>>> stores data in segments that roughly correspond to points on the 
>>>>>>>>>>>>>> grid, you’re not rereading the same data over and over. A repair 
>>>>>>>>>>>>>> for a given grid point only reads an amount of data proportional 
>>>>>>>>>>>>>> to the data in common for the base/view grid point, and it’s 
>>>>>>>>>>>>>> stored in a small enough granularity that the base can calculate 
>>>>>>>>>>>>>> what data needs to be sent to the view without having to read 
>>>>>>>>>>>>>> the entire view partition.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>>>>>>>>>>>>>>> Thanks, Blake. I’m open to iterating on the design, and 
>>>>>>>>>>>>>>> hopefully we can come up with a solution that everyone agrees 
>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> My main concern with the index-based solution is the overhead 
>>>>>>>>>>>>>>> it adds to the hot path, as well as having to build indexes 
>>>>>>>>>>>>>>> periodically. As mentioned earlier, this MV repair should be an 
>>>>>>>>>>>>>>> infrequent operation, but the index-based approach shifts some 
>>>>>>>>>>>>>>> of the work to the hot path in order to allow repairs that 
>>>>>>>>>>>>>>> touch only a few nodes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’m wondering if it’s possible to enable or disable index 
>>>>>>>>>>>>>>> building dynamically so that we don’t always incur the cost for 
>>>>>>>>>>>>>>> something that’s rarely needed.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> > it degrades operators ability to react to data problems by 
>>>>>>>>>>>>>>> > imposing a significant upfront processing burden on repair, 
>>>>>>>>>>>>>>> > and that it doesn’t scale well with cluster size
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’m not sure what you mean by “data problems” here. Also, this 
>>>>>>>>>>>>>>> does scale with cluster size—I’ve compared it to full repair, 
>>>>>>>>>>>>>>> and this MV repair should behave similarly. That means as long 
>>>>>>>>>>>>>>> as full repair works, this repair should work as well.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For example, regardless of how large the cluster is, you can 
>>>>>>>>>>>>>>> always enable Merkle tree building on 10% of the nodes at a 
>>>>>>>>>>>>>>> time until all the trees are ready.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I understand that coordinating this type of repair is harder 
>>>>>>>>>>>>>>> than what we currently support, but with CEP-37, we should be 
>>>>>>>>>>>>>>> able to handle this coordination without adding too much burden 
>>>>>>>>>>>>>>> on the operator side.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston 
>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>> __
>>>>>>>>>>>>>>>>> I don't see any outcome here that is good for the community 
>>>>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he 
>>>>>>>>>>>>>>>>> (and I) consider inferior, or he is prevented from 
>>>>>>>>>>>>>>>>> contributing this work.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t 
>>>>>>>>>>>>>>>> a competition. We can collaborate and figure out the best 
>>>>>>>>>>>>>>>> approach to the problem. I’d like to keep discussing it if 
>>>>>>>>>>>>>>>> you’re open to iterating on the design.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I’m not married to our proposal, it’s just the cleanest way we 
>>>>>>>>>>>>>>>> could think of to address what Jon and I both see as blockers 
>>>>>>>>>>>>>>>> in the current proposal. It’s not set in stone though.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>>>>>>>>>>>>>>>>> Hmm, I am very surprised as I helped write that and I 
>>>>>>>>>>>>>>>>> distinctly recall a specific goal was avoiding binding vetoes 
>>>>>>>>>>>>>>>>> as they're so toxic.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Ok, I guess you can block this work if you like.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I don't see any outcome here that is good for the community 
>>>>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he 
>>>>>>>>>>>>>>>>> (and I) consider inferior, or he is prevented from 
>>>>>>>>>>>>>>>>> contributing this work. That isn't a functioning community in 
>>>>>>>>>>>>>>>>> my mind, so I'll be noping out for a while, as I don't see 
>>>>>>>>>>>>>>>>> much value here right now.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 2025/06/06 18:31:08 Blake Eggleston wrote:
>>>>>>>>>>>>>>>>> > Hi Benedict, that’s actually not true. 
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > Here’s a link to the project governance page: 
>>>>>>>>>>>>>>>>> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > The CEP section says:
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > “*Once the proposal is finalized and any major committer 
>>>>>>>>>>>>>>>>> > dissent reconciled, call a [VOTE] on the ML to have the 
>>>>>>>>>>>>>>>>> > proposal adopted. The criteria for acceptance is consensus 
>>>>>>>>>>>>>>>>> > (3 binding +1 votes and no binding vetoes). The vote should 
>>>>>>>>>>>>>>>>> > remain open for 72 hours.*”
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > So they’re definitely vetoable. 
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > Also note the part about “*Once the proposal is finalized 
>>>>>>>>>>>>>>>>> > and any major committer dissent reconciled,*” being a 
>>>>>>>>>>>>>>>>> > prerequisite for moving a CEP to [VOTE]. Given the as yet 
>>>>>>>>>>>>>>>>> > unreconciled committer dissent, it wouldn’t even be 
>>>>>>>>>>>>>>>>> > appropriate to move to a VOTE until we get to the bottom of 
>>>>>>>>>>>>>>>>> > this repair discussion.
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith 
>>>>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>>>> > > > but the snapshot repair design is not a viable path 
>>>>>>>>>>>>>>>>> > > > forward. It’s the first iteration of a repair design. 
>>>>>>>>>>>>>>>>> > > > We’ve proposed a second iteration, and we’re open to a 
>>>>>>>>>>>>>>>>> > > > third iteration.
>>>>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>>>>> > > I shan't be participating further in discussion, but I 
>>>>>>>>>>>>>>>>> > > want to make a point of order. The CEP process has no 
>>>>>>>>>>>>>>>>> > > vetoes, so you are not empowered to declare that a design 
>>>>>>>>>>>>>>>>> > > is not viable without the input of the wider community.
>>>>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>>>>> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
>>>>>>>>>>>>>>>>> > > > You can detect and fix the mismatch in a single round 
>>>>>>>>>>>>>>>>> > > > of repair, but the amount of work needed to do it is 
>>>>>>>>>>>>>>>>> > > > _significantly_ higher with snapshot repair. Consider a 
>>>>>>>>>>>>>>>>> > > > case where we have a 300 node cluster w/ RF 3, where 
>>>>>>>>>>>>>>>>> > > > each view partition contains entries mapping to every 
>>>>>>>>>>>>>>>>> > > > token range in the cluster - so 100 ranges. If we lose 
>>>>>>>>>>>>>>>>> > > > a view sstable, it will affect an entire row/column of 
>>>>>>>>>>>>>>>>> > > > the grid. Repair is going to scan all data in the 
>>>>>>>>>>>>>>>>> > > > mismatching view token ranges 100 times, and each base 
>>>>>>>>>>>>>>>>> > > > range once. So you’re looking at 200 range scans.
>>>>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>>>>> > > > Now, you may argue that you can merge the duplicate 
>>>>>>>>>>>>>>>>> > > > view scans into a single scan while you repair all 
>>>>>>>>>>>>>>>>> > > > token ranges in parallel. I’m skeptical that’s going to 
>>>>>>>>>>>>>>>>> > > > be achievable in practice, but even if it is, we’re now 
>>>>>>>>>>>>>>>>> > > > talking about the view replica hypothetically doing a 
>>>>>>>>>>>>>>>>> > > > pairwise repair with every other replica in the cluster 
>>>>>>>>>>>>>>>>> > > > at the same time. Neither of these options is workable.
>>>>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>>>>> > > > Let’s take a step back though, because I think we’re 
>>>>>>>>>>>>>>>>> > > > getting lost in the weeds.
>>>>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>>>>> > > > The repair design in the CEP has some high level 
>>>>>>>>>>>>>>>>> > > > concepts that make a lot of sense, the idea of 
>>>>>>>>>>>>>>>>> > > > repairing a grid is really smart. However, it has some 
>>>>>>>>>>>>>>>>> > > > significant drawbacks that remain unaddressed. I want 
>>>>>>>>>>>>>>>>> > > > this CEP to succeed, and I know Jon does too, but the 
>>>>>>>>>>>>>>>>> > > > snapshot repair design is not a viable path forward. 
>>>>>>>>>>>>>>>>> > > > It’s the first iteration of a repair design. We’ve 
>>>>>>>>>>>>>>>>> > > > proposed a second iteration, and we’re open to a third 
>>>>>>>>>>>>>>>>> > > > iteration. This part of the CEP process is meant to 
>>>>>>>>>>>>>>>>> > > > identify and address shortcomings, I don’t think that 
>>>>>>>>>>>>>>>>> > > > continuing to dissect the snapshot repair design is 
>>>>>>>>>>>>>>>>> > > > making progress in that direction.
>>>>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>>>>> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
>>>>>>>>>>>>>>>>> > > > > >  We potentially have to do it several times on each 
>>>>>>>>>>>>>>>>> > > > > > node, depending on the size of the range. Smaller 
>>>>>>>>>>>>>>>>> > > > > > ranges increase the size of the board 
>>>>>>>>>>>>>>>>> > > > > > exponentially, larger ranges increase the number of 
>>>>>>>>>>>>>>>>> > > > > > SSTables that would be involved in each compaction.
>>>>>>>>>>>>>>>>> > > > > As described in the CEP example, this can be handled 
>>>>>>>>>>>>>>>>> > > > > in a single round of repair. We first identify all 
>>>>>>>>>>>>>>>>> > > > > the points in the grid that require repair, then 
>>>>>>>>>>>>>>>>> > > > > perform anti-compaction and stream data based on a 
>>>>>>>>>>>>>>>>> > > > > second scan over those identified points. This 
>>>>>>>>>>>>>>>>> > > > > applies to the snapshot-based solution—without an 
>>>>>>>>>>>>>>>>> > > > > index, repairing a single point in that grid requires 
>>>>>>>>>>>>>>>>> > > > > scanning the entire base table partition (token 
>>>>>>>>>>>>>>>>> > > > > range). In contrast, with the index-based solution—as 
>>>>>>>>>>>>>>>>> > > > > in the example you referenced—if a large block of 
>>>>>>>>>>>>>>>>> > > > > data is corrupted, even though the index is used for 
>>>>>>>>>>>>>>>>> > > > > comparison, many key mismatches may occur. This can 
>>>>>>>>>>>>>>>>> > > > > lead to random disk access to the original data 
>>>>>>>>>>>>>>>>> > > > > files, which could cause performance issues. For the 
>>>>>>>>>>>>>>>>> > > > > case you mentioned for snapshot based solution, it 
>>>>>>>>>>>>>>>>> > > > > should not take months to repair all the data, 
>>>>>>>>>>>>>>>>> > > > > instead one round of repair should be enough. The 
>>>>>>>>>>>>>>>>> > > > > actual repair phase is split from the detection phase.
>>>>>>>>>>>>>>>>> > > > > 
>>>>>>>>>>>>>>>>> > > > > 
>>>>>>>>>>>>>>>>> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad 
>>>>>>>>>>>>>>>>> > > > > <[email protected]> wrote:
>>>>>>>>>>>>>>>>> > > > >> > This isn’t really the whole story. The amount of 
>>>>>>>>>>>>>>>>> > > > >> > wasted scans on index repairs is negligible. If a 
>>>>>>>>>>>>>>>>> > > > >> > difference is detected with snapshot repairs 
>>>>>>>>>>>>>>>>> > > > >> > though, you have to read the entire partition from 
>>>>>>>>>>>>>>>>> > > > >> > both the view and base table to calculate what 
>>>>>>>>>>>>>>>>> > > > >> > needs to be fixed.
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> You nailed it.
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> When the base table is converted to a view, and sent 
>>>>>>>>>>>>>>>>> > > > >> to the view, the information we have is that one of 
>>>>>>>>>>>>>>>>> > > > >> the view's partition keys needs a repair.  That's 
>>>>>>>>>>>>>>>>> > > > >> going to be different from the partition key of the 
>>>>>>>>>>>>>>>>> > > > >> base table.  As a result, on the base table, for 
>>>>>>>>>>>>>>>>> > > > >> each affected range, we'd have to issue another 
>>>>>>>>>>>>>>>>> > > > >> compaction across the entire set of sstables that 
>>>>>>>>>>>>>>>>> > > > >> could have the data the view needs (potentially many 
>>>>>>>>>>>>>>>>> > > > >> GB), in order to send over the corrected version of 
>>>>>>>>>>>>>>>>> > > > >> the partition, then send it over to the view.  
>>>>>>>>>>>>>>>>> > > > >> Without an index in place, we have to do yet another 
>>>>>>>>>>>>>>>>> > > > >> scan, per-affected range.  
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> Consider the case of a single corrupted SSTable on 
>>>>>>>>>>>>>>>>> > > > >> the view that's removed from the filesystem, or the 
>>>>>>>>>>>>>>>>> > > > >> data is simply missing after being restored from an 
>>>>>>>>>>>>>>>>> > > > >> inconsistent backup.  It presumably contains lots of 
>>>>>>>>>>>>>>>>> > > > >> partitions, which maps to base partitions all over 
>>>>>>>>>>>>>>>>> > > > >> the cluster, in a lot of different token ranges.  
>>>>>>>>>>>>>>>>> > > > >> For every one of those ranges (hundreds, to tens of 
>>>>>>>>>>>>>>>>> > > > >> thousands of them given the checkerboard design), 
>>>>>>>>>>>>>>>>> > > > >> when finding the missing data in the base, you'll 
>>>>>>>>>>>>>>>>> > > > >> have to perform a compaction across all the SSTables 
>>>>>>>>>>>>>>>>> > > > >> that potentially contain the missing data just to 
>>>>>>>>>>>>>>>>> > > > >> rebuild the view-oriented partitions that need to be 
>>>>>>>>>>>>>>>>> > > > >> sent to the view.  The complexity of this operation 
>>>>>>>>>>>>>>>>> > > > >> can be looked at as O(N*M) where N and M are the 
>>>>>>>>>>>>>>>>> > > > >> number of ranges in the base table and the view 
>>>>>>>>>>>>>>>>> > > > >> affected by the corruption, respectively.  Without 
>>>>>>>>>>>>>>>>> > > > >> an index in place, finding the missing data is very 
>>>>>>>>>>>>>>>>> > > > >> expensive.  We potentially have to do it several 
>>>>>>>>>>>>>>>>> > > > >> times on each node, depending on the size of the 
>>>>>>>>>>>>>>>>> > > > >> range.  Smaller ranges increase the size of the 
>>>>>>>>>>>>>>>>> > > > >> board exponentially, larger ranges increase the 
>>>>>>>>>>>>>>>>> > > > >> number of SSTables that would be involved in each 
>>>>>>>>>>>>>>>>> > > > >> compaction.
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> Then you send that data over to the view, the view 
>>>>>>>>>>>>>>>>> > > > >> does it's anti-compaction thing, again, once per 
>>>>>>>>>>>>>>>>> > > > >> affected range.  So now the view has to do an 
>>>>>>>>>>>>>>>>> > > > >> anti-compaction once per block on the board that's 
>>>>>>>>>>>>>>>>> > > > >> affected by the missing data.  
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> Doing hundreds or thousands of these will add up 
>>>>>>>>>>>>>>>>> > > > >> pretty quickly.
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> When I said that a repair could take months, this is 
>>>>>>>>>>>>>>>>> > > > >> what I had in mind.
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> 
>>>>>>>>>>>>>>>>> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
>>>>>>>>>>>>>>>>> > > > >> <[email protected]> wrote:
>>>>>>>>>>>>>>>>> > > > >>> __
>>>>>>>>>>>>>>>>> > > > >>> > Adds overhead in the hot path due to maintaining 
>>>>>>>>>>>>>>>>> > > > >>> > indexes. Extra memory needed during write path 
>>>>>>>>>>>>>>>>> > > > >>> > and compaction.
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> I’d make the same argument about the overhead of 
>>>>>>>>>>>>>>>>> > > > >>> maintaining the index that Jon just made about the 
>>>>>>>>>>>>>>>>> > > > >>> disk space required. The relatively predictable 
>>>>>>>>>>>>>>>>> > > > >>> overhead of maintaining the index as part of the 
>>>>>>>>>>>>>>>>> > > > >>> write and compaction paths is a pro, not a con. 
>>>>>>>>>>>>>>>>> > > > >>> Although you’re not always paying the cost of 
>>>>>>>>>>>>>>>>> > > > >>> building a merkle tree with snapshot repair, it can 
>>>>>>>>>>>>>>>>> > > > >>> impact the hot path and you do have to plan for it.
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> > Verifies index content, not actual data—may miss 
>>>>>>>>>>>>>>>>> > > > >>> > low-probability errors like bit flips
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> Presumably this could be handled by the views 
>>>>>>>>>>>>>>>>> > > > >>> performing repair against each other? You could 
>>>>>>>>>>>>>>>>> > > > >>> also periodically rebuild the index or perform 
>>>>>>>>>>>>>>>>> > > > >>> checksums against the sstable content.
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> > Extra data scan during inconsistency detection
>>>>>>>>>>>>>>>>> > > > >>> > Index: Since the data covered by certain indexes 
>>>>>>>>>>>>>>>>> > > > >>> > is not guaranteed to be fully contained within a 
>>>>>>>>>>>>>>>>> > > > >>> > single node as the topology changes, some data 
>>>>>>>>>>>>>>>>> > > > >>> > scans may be wasted.
>>>>>>>>>>>>>>>>> > > > >>> > Snapshots: No extra data scan
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> This isn’t really the whole story. The amount of 
>>>>>>>>>>>>>>>>> > > > >>> wasted scans on index repairs is negligible. If a 
>>>>>>>>>>>>>>>>> > > > >>> difference is detected with snapshot repairs 
>>>>>>>>>>>>>>>>> > > > >>> though, you have to read the entire partition from 
>>>>>>>>>>>>>>>>> > > > >>> both the view and base table to calculate what 
>>>>>>>>>>>>>>>>> > > > >>> needs to be fixed.
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>>>>>>>>>>>>>>>>> > > > >>>> One practical aspect that isn't immediately 
>>>>>>>>>>>>>>>>> > > > >>>> obvious is the disk space consideration for 
>>>>>>>>>>>>>>>>> > > > >>>> snapshots.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> When you have a table with a mixed workload using 
>>>>>>>>>>>>>>>>> > > > >>>> LCS or UCS with scaling parameters like L10 and 
>>>>>>>>>>>>>>>>> > > > >>>> initiate a repair, the disk usage will increase as 
>>>>>>>>>>>>>>>>> > > > >>>> long as the snapshot persists and the table 
>>>>>>>>>>>>>>>>> > > > >>>> continues to receive writes. This aspect is 
>>>>>>>>>>>>>>>>> > > > >>>> understood and factored into the design.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> However, a more nuanced point is the necessity to 
>>>>>>>>>>>>>>>>> > > > >>>> maintain sufficient disk headroom specifically for 
>>>>>>>>>>>>>>>>> > > > >>>> running repairs. This echoes the challenge with 
>>>>>>>>>>>>>>>>> > > > >>>> STCS compaction, where enough space must be 
>>>>>>>>>>>>>>>>> > > > >>>> available to accommodate the largest SSTables, 
>>>>>>>>>>>>>>>>> > > > >>>> even when they are not being actively compacted.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> For example, if a repair involves rewriting 100GB 
>>>>>>>>>>>>>>>>> > > > >>>> of SSTable data, you'll consistently need to 
>>>>>>>>>>>>>>>>> > > > >>>> reserve 100GB of free space to facilitate this.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> Therefore, while the snapshot-based approach leads 
>>>>>>>>>>>>>>>>> > > > >>>> to variable disk space utilization, operators must 
>>>>>>>>>>>>>>>>> > > > >>>> provision storage as if the maximum potential 
>>>>>>>>>>>>>>>>> > > > >>>> space will be used at all times to ensure repairs 
>>>>>>>>>>>>>>>>> > > > >>>> can be executed.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> This introduces a rate of churn dynamic, where the 
>>>>>>>>>>>>>>>>> > > > >>>> write throughput dictates the required extra disk 
>>>>>>>>>>>>>>>>> > > > >>>> space, rather than the existing on-disk data 
>>>>>>>>>>>>>>>>> > > > >>>> volume.
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> If 50% of your SSTables are rewritten during a 
>>>>>>>>>>>>>>>>> > > > >>>> snapshot, you would need 50% free disk space. 
>>>>>>>>>>>>>>>>> > > > >>>> Depending on the workload, the snapshot method 
>>>>>>>>>>>>>>>>> > > > >>>> could consume significantly more disk space than 
>>>>>>>>>>>>>>>>> > > > >>>> an index-based approach. Conversely, for 
>>>>>>>>>>>>>>>>> > > > >>>> relatively static workloads, the index method 
>>>>>>>>>>>>>>>>> > > > >>>> might require more space. It's not as 
>>>>>>>>>>>>>>>>> > > > >>>> straightforward as stating "No extra disk space 
>>>>>>>>>>>>>>>>> > > > >>>> needed".
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> Jon
>>>>>>>>>>>>>>>>> > > > >>>> 
>>>>>>>>>>>>>>>>> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu 
>>>>>>>>>>>>>>>>> > > > >>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>> > > > >>>>> > Regarding your comparison between approaches, I 
>>>>>>>>>>>>>>>>> > > > >>>>> > think you also need to take into account the 
>>>>>>>>>>>>>>>>> > > > >>>>> > other dimensions that have been brought up in 
>>>>>>>>>>>>>>>>> > > > >>>>> > this thread. Things like minimum repair times 
>>>>>>>>>>>>>>>>> > > > >>>>> > and vulnerability to outages and topology 
>>>>>>>>>>>>>>>>> > > > >>>>> > changes are the first that come to mind.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Sure, I added a few more points.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> *Perspective*
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> *Index-Based Solution*
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> *Snapshot-Based Solution*
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 1. Hot path overhead
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Adds overhead in the hot path due to maintaining 
>>>>>>>>>>>>>>>>> > > > >>>>> indexes. Extra memory needed during write path 
>>>>>>>>>>>>>>>>> > > > >>>>> and compaction.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> No impact on the hot path
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 2. Extra disk usage when repair is not running
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space to store 
>>>>>>>>>>>>>>>>> > > > >>>>> persistent indexes
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> No extra disk space needed
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 3. Extra disk usage during repair
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Minimal or no additional disk usage
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space for snapshots
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 4. Fine-grained repair  to deal with emergency 
>>>>>>>>>>>>>>>>> > > > >>>>> situations / topology changes
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Supports fine-grained repairs by targeting 
>>>>>>>>>>>>>>>>> > > > >>>>> specific index ranges. This allows repair to be 
>>>>>>>>>>>>>>>>> > > > >>>>> retried on smaller data sets, enabling 
>>>>>>>>>>>>>>>>> > > > >>>>> incremental progress when repairing the entire 
>>>>>>>>>>>>>>>>> > > > >>>>> table. This is especially helpful when there are 
>>>>>>>>>>>>>>>>> > > > >>>>> down nodes or topology changes during repair, 
>>>>>>>>>>>>>>>>> > > > >>>>> which are common in day-to-day operations.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Coordination across all nodes is required over a 
>>>>>>>>>>>>>>>>> > > > >>>>> long period of time. For each round of repair, if 
>>>>>>>>>>>>>>>>> > > > >>>>> all replica nodes are down or if there is a 
>>>>>>>>>>>>>>>>> > > > >>>>> topology change, the data ranges that were not 
>>>>>>>>>>>>>>>>> > > > >>>>> covered will need to be repaired in the next 
>>>>>>>>>>>>>>>>> > > > >>>>> round.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 5. Validating data used in reads directly
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Verifies index content, not actual data—may miss 
>>>>>>>>>>>>>>>>> > > > >>>>> low-probability errors like bit flips
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Verifies actual data content, providing stronger 
>>>>>>>>>>>>>>>>> > > > >>>>> correctness guarantees
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 6. Extra data scan during inconsistency detection
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Since the data covered by certain indexes is not 
>>>>>>>>>>>>>>>>> > > > >>>>> guaranteed to be fully contained within a single 
>>>>>>>>>>>>>>>>> > > > >>>>> node as the topology changes, some data scans may 
>>>>>>>>>>>>>>>>> > > > >>>>> be wasted.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> No extra data scan
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 7. The overhead of actual data repair after an 
>>>>>>>>>>>>>>>>> > > > >>>>> inconsistency is detected
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Only indexes are streamed to the base table node, 
>>>>>>>>>>>>>>>>> > > > >>>>> and the actual data being fixed can be as 
>>>>>>>>>>>>>>>>> > > > >>>>> accurate as the row level.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Anti-compaction is needed on the MV table, and 
>>>>>>>>>>>>>>>>> > > > >>>>> potential over-streaming may occur due to the 
>>>>>>>>>>>>>>>>> > > > >>>>> lack of row-level insight into data quality.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> > one of my biggest concerns I haven't seen 
>>>>>>>>>>>>>>>>> > > > >>>>> > discussed much is LOCAL_SERIAL/SERIAL on read
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> Paxos v2 introduces an optimization where serial 
>>>>>>>>>>>>>>>>> > > > >>>>> reads can be completed in just one round trip, 
>>>>>>>>>>>>>>>>> > > > >>>>> reducing latency compared to traditional Paxos 
>>>>>>>>>>>>>>>>> > > > >>>>> which may require multiple phases.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> > I think a refresh would be low-cost and give 
>>>>>>>>>>>>>>>>> > > > >>>>> > users the flexibility to run them however they 
>>>>>>>>>>>>>>>>> > > > >>>>> > want.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>>>> I think this is an interesting idea. Does it 
>>>>>>>>>>>>>>>>> > > > >>>>> suggest that the MV should be rebuilt on a 
>>>>>>>>>>>>>>>>> > > > >>>>> regular schedule? It sounds like an extension of 
>>>>>>>>>>>>>>>>> > > > >>>>> the snapshot-based approach—rather than detecting 
>>>>>>>>>>>>>>>>> > > > >>>>> mismatches, we would periodically reconstruct a 
>>>>>>>>>>>>>>>>> > > > >>>>> clean version of the MV based on the snapshot. 
>>>>>>>>>>>>>>>>> > > > >>>>> This seems to diverge from the current MV model 
>>>>>>>>>>>>>>>>> > > > >>>>> in Cassandra, where consistency between the MV 
>>>>>>>>>>>>>>>>> > > > >>>>> and base table must be maintained continuously. 
>>>>>>>>>>>>>>>>> > > > >>>>> This could be an extension of the CEP-48 work, 
>>>>>>>>>>>>>>>>> > > > >>>>> where the MV is periodically rebuilt from a 
>>>>>>>>>>>>>>>>> > > > >>>>> snapshot of the base table, assuming the user can 
>>>>>>>>>>>>>>>>> > > > >>>>> tolerate some level of staleness in the MV data.
>>>>>>>>>>>>>>>>> > > > >>>>> 
>>>>>>>>>>>>>>>>> > > > >>> 
>>>>>>>>>>>>>>>>> > > > 
>>>>>>>>>>>>>>>>> > > 
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to