>What if the CEP includes an interface for MV repair that calls out to some user pluggable solution and the spark-based solution you've developed is the first / only reference solution available at the outset? That way we could integrate it into the control plane (nodetool, JMX, a post CEP-37 world), have a solution available for those comfortable taking on the spark dependency, and have a clear paved path for future anti-entropy work on where to fit into the ecosystem.
It is acceptable to include an orchestration mechanism as a stretch goal of this CEP, which would provide the following capabilities: 1. Trigger MV Repair: Invoke a pluggable JAR to start the MV repair process via a method such as: String invokeMVRepair(String keyspace, String baseTable, String mvTable, List<Object> options) 2. Track Job Status: Expose an API to query the status of the repair job: JobStatus jobStatus(String ID) This orchestration framework could be built on top of the existing AutoRepair infrastructure from CEP-37, and extended within Cassandra to manage MV-specific repair workflows. It would process one or more MVs at a time by invoking the repair via API #1 and tracking the status via API #2. The default implementation can be a noop, allowing users to plug in their own logic. This design leaves room for integration with external repair solutions, such as the cassandra-mv-repair-spark-job <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>, or for users to build custom repair mechanisms. However, it's important to note that the cassandra-mv-repair-spark-job <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> itself is explicitly out of scope for this CEP, as it has already been addressed in a separate discussion thread <https://lists.apache.org/thread/d3qo3vjxn4116htf175yzcg94s6jq07d>. Jaydeep On Fri, Aug 1, 2025 at 6:42 AM Josh McKenzie <jmcken...@apache.org> wrote: > Definitely want to avoid scope creep, *however*... ;) > > What if the CEP includes an interface for MV repair that calls out to some > user pluggable solution and the spark-based solution you've developed is > the first / only reference solution available at the outset? That way we > could integrate it into the control plane (nodetool, JMX, a post CEP-37 > world), have a solution available for those comfortable taking on the spark > dependency, and have a clear paved path for future anti-entropy work on > where to fit into the ecosystem. > > On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote: > > Based on our discussion, it seems we’ve reached broad agreement on the > hot‑path optimization for strict MV consistency mode. We still have > concerns about the different approaches for the repair path. I’ve updated > the CEP’s scope and retitled it *‘Cassandra Materialized View > Consistency, Reliability & Backfill Enhancement’.* I removed the repair > section and added a backfill section that proposes an optimized strategy to > backfill MVs from the base table. Please take a look and share your > thoughts. I believe this backfill strategy will also benefit future MV > repair work. > > Regarding repair, > > While repair is essential to maintain Materialized View (MV) consistency > in the face of one-off bugs or bit rot, implementing a robust, native > repair mechanism within Apache Cassandra remains a highly complex and still > largely theoretical challenge, irrespective of the strategies considered in > prior discussions. > > To ensure steady progress, we propose addressing the problem > incrementally. This CEP focuses on making the hot-path & backfill reliable, > laying a solid foundation. A future CEP can then address repair mechanisms > within the Apache Cassandra ecosystem in a more focused and well-scoped > manner. > For now, the proposed CEP clarifies that users may choose to either > evaluate the external Spark-based repair tool > <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> > (which is not part of this proposal) or develop their own repair solution > tailored to their operational needs. > > On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu <curly...@gmail.com> wrote: > > I believe that, with careful design, we could make the row-level repair > work using the index-based approach. > However, regarding the CEP-proposed solution, I want to highlight that the > entire repair job can be divided into two parts: > > 1. > > Inconsistency detection > 2. > > Rebuilding the inconsistent ranges > > The second step could look like this: > > nodetool mv_rebuild --base_table_range <b_start,b_end> --mv_ranges > [<mv_start1, mv_end1>, <mv_start2, mv_end2>, ...] > > > The base table range and MV ranges would come from the first step. But > it’s also possible to run this rebuild directly—without detecting > inconsistencies—if we already know that some nodes need repair. This means > we don’t have to wait for the snapshot processing to start the repair. > Snapshot-based detection works well when the cluster is generally healthy. > > Compared to the current rebuild, the proposed approach doesn’t require > dropping the MV and rebuilding it from scratch across the entire > cluster—which is a major blocker for production environments. > > That said, for the index-based approach, I still think it introduces > additional load on both the write path and compaction. This increases the > resource cost of using MVs. While it’s a nice feature to enable row-level > MV repair, its implementation is complex. On the other hand, the original > proposal’s rebuild strategy is more versatile and likely to perform > better—especially when only a few nodes need repair. In contrast, > large-scale row-level comparisons via the index-based approach could be > prohibitively expensive. Just to clarify my intention here: this is not a > complete 'no' to the index-based approach. However, for the initial > version, I believe it's more prudent to avoid impacting the hot path. We > can start with a simpler design that keeps the critical path untouched, and > then iterate to make it more robust over time—much like how other features > in Apache Cassandra have evolved. For example, 'full repair' came first, > followed by 'incremental repair'; similarly, STCS was later complemented by > LCS. This phased evolution allows us to balance safety, stability, and > long-term capability. > > Given all this, I still prefer to move forward with the original proposal. > It allows us to avoid paying the repair overhead most of the time. > > > On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > Those are both fair points. Once you start dealing with data loss though, > maintaining guarantees is often impossible, so I’m not sure that torn > writes or updated timestamps are dealbreakers, but I’m fine with tabling > option 2 for now and seeing if we can figure out something better. > > Regarding the assassin cells, if you wanted to avoid explicitly agreeing > on a value, you might be able to only issue them for repaired base data, > which has been implicitly agreed upon. > > I think that or something like it is worth exploring. The idea would be to > solve this issue as completely as anti-compaction would - but without > having to rewrite sstables. I’d be interested to hear any ideas you have > about how that might work. > > You basically need a mechanism to erase some piece of data that was > written before a given wall clock time - regardless of cell timestamp, and > without precluding future updates (in wall clock time) with earlier > timestamps. > > On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote: > > In the second option, we use the repair timestamp to re-update any cell or > row we want to fix in the base table. This approach is problematic because > it alters the write time of user-supplied data. Although Cassandra does not > allow users to set timestamps for LWT writes, users may still rely on the > update time. A key limitation of this approach is that it cannot fix cases > where a view cell ends up in a future state while the base table remains > correct. I now understand your point that Cassandra cannot handle this > scenario today. However, as I mentioned earlier, the important distinction > is that when this issue occurs in the base table, we accept the "incorrect" > data as valid—but this is not acceptable for materialized views, since the > source of truth (the base table) still holds the correct data. > > On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > > Sorry, Blake—I was traveling last week and couldn’t reply to your email > sooner. > > No worries, I’ll be taking some time off soon as well. > > > I don’t think the first or second option is ideal. We should treat the > base table as the source of truth. Modifying it—or forcing an update on it, > even if it’s just a timestamp change—is not a good approach and won’t solve > all problems. > > I agree the first option probably isn’t the right way to go. Could you say > a bit more about why the second option is not a good approach and which > problems it won’t solve? > > On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote: > > Sorry, Blake—I was traveling last week and couldn’t reply to your email > sooner. > > > First - we interpret view data with higher timestamps than the base > table as data that’s missing from the base and replicate it into the base > table. The timestamp of the missing data may be below the paxos timestamp > low bound so we’d have to adjust the paxos coordination logic to allow that > in this case. Depending on how the view got this way it may also tear > writes to the base table, breaking the write atomicity promise. > > As discussed earlier, we want this MV repair mechanism to handle all edge > cases. However, it would be difficult to design it in a way that detects > the root cause of each mismatch and repairs it accordingly. Additionally, > as you mentioned, this approach could introduce other issues, such as > violating the write atomicity guarantee. > > > Second - If this happens it means that we’ve either lost base table data > or paxos metadata. If that happened, we could force a base table update > that rewrites the current base state with new timestamps making the extra > view data removable. However this wouldn’t fix the case where the view cell > has a timestamp from the future - although that’s not a case that C* can > fix today either. > > I don’t think the first or second option is ideal. We should treat the > base table as the source of truth. Modifying it—or forcing an update on it, > even if it’s just a timestamp change—is not a good approach and won’t solve > all problems. > > > the idea to use anti-compaction makes a lot more sense now (in principle > - I don’t think it’s workable in practice) > > I have one question regarding anti-compaction. Is the main concern that > processing too much data during anti-compaction could cause issues for the > cluster? > > > I guess you could add some sort of assassin cell that is meant to remove > a cell with a specific timestamp and value, but is otherwise invisible. > > The idea of the assassination cell is interesting. To prevent data from > being incorrectly removed during the repair process, we need to ensure a > quorum of nodes is available and agrees on the same value before repairing > a materialized view (MV) row or cell. However, this could be very > expensive, as it requires coordination to repair even a single row. > > I think there are a few key differences between MV repair and normal > anti-entropy repair: > > 1. > > For normal anti-entropy repair, there is no single source of truth. > All replicas act as sources of truth, and eventual consistency is > achieved by merging data across them. Even if some data is lost or bugs > result in future timestamps, the replicas will eventually converge on the > same (possibly incorrect) value, and that value becomes the accepted truth. > In contrast, MV repair relies on the base table as the source of truth and > modifies the MV if inconsistencies are detected. This means that in some > cases, simply merging data won’t resolve the issue, since Cassandra > resolves conflicts using timestamps and there’s no guarantee the base table > will always "win" unless we change the MV merging logic—which I’ve been > trying to avoid. > 2. > > I see MV repair as more of an on-demand operation, whereas normal > anti-entropy repair needs to run regularly. This means we shouldn’t treat > MV repair the same way as existing repairs. When an operator initiates MV > repair, they need to ensure that sufficient resources are available to > support it. > > > On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > Got it, thanks for clearing that up. I’d seen the strict liveness code > around but didn’t realize it was MV related and hadn’t dug into what it did > or how it worked. > > I think you’re right about the row liveness update working for extra data > with timestamps lower than the most recent base table update. > > I see what you mean about the timestamp from the future case. I thought of > 3 options: > > First - we interpret view data with higher timestamps than the base table > as data that’s missing from the base and replicate it into the base table. > The timestamp of the missing data may be below the paxos timestamp low > bound so we’d have to adjust the paxos coordination logic to allow that in > this case. Depending on how the view got this way it may also tear writes > to the base table, breaking the write atomicity promise. > > Second - If this happens it means that we’ve either lost base table data > or paxos metadata. If that happened, we could force a base table update > that rewrites the current base state with new timestamps making the extra > view data removable. However this wouldn’t fix the case where the view cell > has a timestamp from the future - although that’s not a case that C* can > fix today either. > > Third - we add a new tombstone type or some mechanism to delete specific > cells that doesn’t preclude correct writes with lower timestamps from being > visible. I’m not sure how this would work, and the idea to use > anti-compaction makes a lot more sense now (in principle - I don’t think > it’s workable in practice). I guess you could add some sort of assassin > cell that is meant to remove a cell with a specific timestamp and value, > but is otherwise invisible. This seems dangerous though, since it’s likely > there’s a replication problem that may resolve itself and the repair > process would actually be removing data that the user intended to write. > > Paulo - I don’t think storage changes are off the table, but they do > expand the scope and risk of the proposal, so we should be careful. > > On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote: > > > I’m not sure if this is the only edge case—there may be other issues as > well. I’m also unsure whether we should redesign the tombstone handling for > MVs, since that would involve changes to the storage engine. To minimize > impact there, the original proposal was to rebuild the affected ranges > using anti-compaction, just to be safe. > > I haven't been following the discussion but I think one of the issues with > the materialized view "strict liveness" fix[1] is that we avoided making > invasive changes to the storage engine at the time, but this was considered > by Zhao on [1]. I think we shouldn't be trying to avoid updates to the > storage format as part of the MV implementation, if this is what it takes > to make MVs V2 reliable. > > [1] - > https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603 > > On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <curly...@gmail.com> wrote: > > The current design leverages strict liveness to shadow the old view row. > When the view-indexed value changes from 'a' to 'b', no tombstone is > written; instead, the old row is marked as expired by updating its liveness > info with the timestamp of the change. If the column is later set back to > 'a', the view row is re-inserted with a new, non-expired liveness info > reflecting the latest timestamp. > > To delete an extra row in the materialized view (MV), we can likely use > the same approach—marking it as shadowed by updating the liveness info. > However, in the case of inconsistencies where a column in the MV has a > higher timestamp than the corresponding column in the base table, this > row-level liveness mechanism is insufficient. > > Even for the case where we delete the row by marking its liveness info as > expired during repair, there are concerns. Since this introduces a data > mutation as part of the repair process, it’s unclear whether there could be > edge cases we’re missing. This approach may risk unexpected side effects if > the repair logic is not carefully aligned with write path semantics. > > On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > That’s a good point, although as described I don’t think that could ever > work properly, even in normal operation. Either we’re misunderstanding > something, or this is a flaw in the current MV design. > > Assuming changing the view indexed column results in a tombstone being > applied to the view row for the previous value, if we wrote the other base > columns (the non view indexed ones) to the view with the same timestamps > they have on the base, then changing the view indexed value from ‘a’ to > ‘b’, then back to ‘a’ would always cause this problem. I think you’d need > to always update the column timestamps on the view to be >= the view column > timestamp on the base > > On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote: > > > In the case of a missed update, we'll have a new value and we can send > a tombstone to the view with the timestamp of the most recent update. > > > then something has gone wrong and we should issue a tombstone using the > paxos repair timestamp as the tombstone timestamp. > > The current MV implementation uses “strict liveness” to determine whether > a row is live. I believe that using regular tombstones during repair could > cause problems. For example, consider a base table with schema (pk, ck, v1, > v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some > reason, we detect an extra row in the MV and delete it using a tombstone > with the latest update timestamp, we may run into issues. Suppose we later > update the base table’s v1 field to match the MV row we previously deleted, > and the v2 value now has an older timestamp. In that case, the previously > issued tombstone could still shadow the v2 column, which is unintended. > That is why I was asking if we are going to introduce a new kind of > tombstones. I’m not sure if this is the only edge case—there may be other > issues as well. I’m also unsure whether we should redesign the tombstone > handling for MVs, since that would involve changes to the storage engine. > To minimize impact there, the original proposal was to rebuild the affected > ranges using anti-compaction, just to be safe. > > On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > Extra row in MV (assuming the tombstone is gone in the base table) — how > should we fix this? > > > > This would mean that the base table had either updated or deleted a row > and the view didn't receive the corresponding delete. > > In the case of a missed update, we'll have a new value and we can send a > tombstone to the view with the timestamp of the most recent update. Since > timestamps issued by paxos and accord writes are always increasing > monotonically and don't have collisions, this is safe. > > In the case of a row deletion, we'd also want to send a tombstone with the > same timestamp, however since tombstones can be purged, we may not have > that information and would have to treat it like the view has a higher > timestamp than the base table. > > Inconsistency (timestamps don’t match) — it’s easy to fix when the base > table has higher timestamps, but how do we resolve it when the MV columns > have higher timestamps? > > > There are 2 ways this could happen. First is that a write failed and paxos > repair hasn't completed it, which is expected, and the second is a > replication bug or base table data loss. You'd need to compare the view > timestamp to the paxos repair history to tell which it is. If the view > timestamp is higher than the most recent paxos repair timestamp for the > key, then it may just be a failed write and we should do nothing. If the > view timestamp is less than the most recent paxos repair timestamp for that > key and higher than the base timestamp, then something has gone wrong and > we should issue a tombstone using the paxos repair timestamp as the > tombstone timestamp. This is safe to do because the paxos repair timestamps > act as a low bound for ballots paxos will process, so it wouldn't be > possible for a legitimate write to be shadowed by this tombstone. > > Do we need to introduce a new kind of tombstone to shadow the rows in the > MV for cases 2 and 3? If yes, how will this tombstone work? If no, how > should we fix the MV data? > > > No, a normal tombstone would work. > > On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote: > > Okay, let’s put the efficiency discussion on hold for now. I want to make > sure the actual repair process after detecting inconsistencies will work > with the index-based solution. > > When a mismatch is detected, the MV replica will need to stream its index > file to the base table replica. The base table will then perform a > comparison between the two files. > > There are three cases we need to handle: > > 1. > > Missing row in MV — this is straightforward; we can propagate the data > to the MV. > 2. > > Extra row in MV (assuming the tombstone is gone in the base table) — > how should we fix this? > 3. > > Inconsistency (timestamps don’t match) — it’s easy to fix when the > base table has higher timestamps, but how do we resolve it when the MV > columns have higher timestamps? > > Do we need to introduce a new kind of tombstone to shadow the rows in the > MV for cases 2 and 3? If yes, how will this tombstone work? If no, how > should we fix the MV data? > > On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > > hopefully we can come up with a solution that everyone agrees on. > > I’m sure we can, I think we’ve been making good progress > > > My main concern with the index-based solution is the overhead it adds to > the hot path, as well as having to build indexes periodically. > > So the additional overhead of maintaining a storage attached index on the > client write path is pretty minimal - it’s basically adding data to an in > memory trie. It’s a little extra work and memory usage, but there isn’t any > extra io or other blocking associated with it. I’d expect the latency > impact to be negligible. > > > As mentioned earlier, this MV repair should be an infrequent operation > > I don’t this that’s a safe assumption. There are a lot of situations > outside of data loss bugs where repair would need to be run. > > These use cases could probably be handled by repairing the view with other > view replicas: > > Scrubbing corrupt sstables > Node replacement via backup > > These use cases would need an actual MV repair to check consistency with > the base table: > > Restoring a cluster from a backup > Imported sstables via nodetool import > Data loss from operator error > Proactive consistency checks - ie preview repairs > > Even if it is an infrequent operation, when operators need it, it needs to > be available and reliable. > > It’s a fact that there are clusters where non-incremental repairs are run > on a cadence of a week or more to manage the overhead of validation > compactions. Assuming the cluster doesn’t have any additional headroom, > that would mean that any one of the above events could cause views to > remain out of sync for up to a week while the full set of merkle trees is > being built. > > This delay eliminates a lot of the value of repair as a risk mitigation > tool. If I had to make a recommendation where a bad call could cost me my > job, the prospect of a 7 day delay on repair would mean a strong no. > > Some users also run preview repair continuously to detect data consistency > errors, so at least a subset of users will probably be running MV repairs > continuously - at least in preview mode. > > That’s why I say that the replication path should be designed to never > need repair, and MV repair should be designed to be prepared for the worst. > > > I’m wondering if it’s possible to enable or disable index building > dynamically so that we don’t always incur the cost for something that’s > rarely needed. > > I think this would be a really reasonable compromise as long as the > default is on. That way it’s as safe as possible by default, but users who > don’t care or have a separate system for repairing MVs can opt out. > > > I’m not sure what you mean by “data problems” here. > > I mean out of sync views - either due to bugs, operator error, corruption, > etc > > > Also, this does scale with cluster size—I’ve compared it to full repair, > and this MV repair should behave similarly. That means as long as full > repair works, this repair should work as well. > > You could build the merkle trees at about the same cost as a full repair, > but the actual data repair path is completely different for MV, and that’s > the part that doesn’t scale well. As you know, with normal repair, we just > stream data for ranges detected as out of sync. For Mvs, since the data > isn’t in base partition order, the view data for an out of sync view range > needs to be read out and streamed to every base replica that it’s detected > a mismatch against. So in the example I gave with the 300 node cluster, > you’re looking at reading and transmitting the same partition at least 100 > times in the best case, and the cost of this keeps going up as the cluster > increases in size. That's the part that doesn't scale well. > > This is also one the benefits of the index design. Since it stores data in > segments that roughly correspond to points on the grid, you’re not > rereading the same data over and over. A repair for a given grid point only > reads an amount of data proportional to the data in common for the > base/view grid point, and it’s stored in a small enough granularity that > the base can calculate what data needs to be sent to the view without > having to read the entire view partition. > > On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote: > > Thanks, Blake. I’m open to iterating on the design, and hopefully we can > come up with a solution that everyone agrees on. > > My main concern with the index-based solution is the overhead it adds to > the hot path, as well as having to build indexes periodically. As mentioned > earlier, this MV repair should be an infrequent operation, but the > index-based approach shifts some of the work to the hot path in order to > allow repairs that touch only a few nodes. > > I’m wondering if it’s possible to enable or disable index building > dynamically so that we don’t always incur the cost for something that’s > rarely needed. > > > it degrades operators ability to react to data problems by imposing a > significant upfront processing burden on repair, and that it doesn’t scale > well with cluster size > > I’m not sure what you mean by “data problems” here. Also, this does scale > with cluster size—I’ve compared it to full repair, and this MV repair > should behave similarly. That means as long as full repair works, this > repair should work as well. > > For example, regardless of how large the cluster is, you can always enable > Merkle tree building on 10% of the nodes at a time until all the trees are > ready. > > I understand that coordinating this type of repair is harder than what we > currently support, but with CEP-37, we should be able to handle this > coordination without adding too much burden on the operator side. > > On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > I don't see any outcome here that is good for the community though. Either > Runtian caves and adopts your design that he (and I) consider inferior, or > he is prevented from contributing this work. > > > Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a > competition. We can collaborate and figure out the best approach to the > problem. I’d like to keep discussing it if you’re open to iterating on the > design. > > I’m not married to our proposal, it’s just the cleanest way we could think > of to address what Jon and I both see as blockers in the current proposal. > It’s not set in stone though. > > On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote: > > Hmm, I am very surprised as I helped write that and I distinctly recall a > specific goal was avoiding binding vetoes as they're so toxic. > > Ok, I guess you can block this work if you like. > > I don't see any outcome here that is good for the community though. Either > Runtian caves and adopts your design that he (and I) consider inferior, or > he is prevented from contributing this work. That isn't a functioning > community in my mind, so I'll be noping out for a while, as I don't see > much value here right now. > > > On 2025/06/06 18:31:08 Blake Eggleston wrote: > > Hi Benedict, that’s actually not true. > > > > Here’s a link to the project governance page: _https:// > cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_ > > > > The CEP section says: > > > > “*Once the proposal is finalized and any major committer dissent > reconciled, call a [VOTE] on the ML to have the proposal adopted. The > criteria for acceptance is consensus (3 binding +1 votes and no binding > vetoes). The vote should remain open for 72 hours.*” > > > > So they’re definitely vetoable. > > > > Also note the part about “*Once the proposal is finalized and any major > committer dissent reconciled,*” being a prerequisite for moving a CEP to > [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even > be appropriate to move to a VOTE until we get to the bottom of this repair > discussion. > > > > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote: > > > > but the snapshot repair design is not a viable path forward. It’s > the first iteration of a repair design. We’ve proposed a second iteration, > and we’re open to a third iteration. > > > > > > I shan't be participating further in discussion, but I want to make a > point of order. The CEP process has no vetoes, so you are not empowered to > declare that a design is not viable without the input of the wider > community. > > > > > > > > > On 2025/06/05 03:58:59 Blake Eggleston wrote: > > > > You can detect and fix the mismatch in a single round of repair, but > the amount of work needed to do it is _significantly_ higher with snapshot > repair. Consider a case where we have a 300 node cluster w/ RF 3, where > each view partition contains entries mapping to every token range in the > cluster - so 100 ranges. If we lose a view sstable, it will affect an > entire row/column of the grid. Repair is going to scan all data in the > mismatching view token ranges 100 times, and each base range once. So > you’re looking at 200 range scans. > > > > > > > > Now, you may argue that you can merge the duplicate view scans into > a single scan while you repair all token ranges in parallel. I’m skeptical > that’s going to be achievable in practice, but even if it is, we’re now > talking about the view replica hypothetically doing a pairwise repair with > every other replica in the cluster at the same time. Neither of these > options is workable. > > > > > > > > Let’s take a step back though, because I think we’re getting lost in > the weeds. > > > > > > > > The repair design in the CEP has some high level concepts that make > a lot of sense, the idea of repairing a grid is really smart. However, it > has some significant drawbacks that remain unaddressed. I want this CEP to > succeed, and I know Jon does too, but the snapshot repair design is not a > viable path forward. It’s the first iteration of a repair design. We’ve > proposed a second iteration, and we’re open to a third iteration. This part > of the CEP process is meant to identify and address shortcomings, I don’t > think that continuing to dissect the snapshot repair design is making > progress in that direction. > > > > > > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: > > > > > > We potentially have to do it several times on each node, > depending on the size of the range. Smaller ranges increase the size of the > board exponentially, larger ranges increase the number of SSTables that > would be involved in each compaction. > > > > > As described in the CEP example, this can be handled in a single > round of repair. We first identify all the points in the grid that require > repair, then perform anti-compaction and stream data based on a second scan > over those identified points. This applies to the snapshot-based > solution—without an index, repairing a single point in that grid requires > scanning the entire base table partition (token range). In contrast, with > the index-based solution—as in the example you referenced—if a large block > of data is corrupted, even though the index is used for comparison, many > key mismatches may occur. This can lead to random disk access to the > original data files, which could cause performance issues. For the case you > mentioned for snapshot based solution, it should not take months to repair > all the data, instead one round of repair should be enough. The actual > repair phase is split from the detection phase. > > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad < > j...@rustyrazorblade.com> wrote: > > > > >> > This isn’t really the whole story. The amount of wasted scans > on index repairs is negligible. If a difference is detected with snapshot > repairs though, you have to read the entire partition from both the view > and base table to calculate what needs to be fixed. > > > > >> > > > > >> You nailed it. > > > > >> > > > > >> When the base table is converted to a view, and sent to the view, > the information we have is that one of the view's partition keys needs a > repair. That's going to be different from the partition key of the base > table. As a result, on the base table, for each affected range, we'd have > to issue another compaction across the entire set of sstables that could > have the data the view needs (potentially many GB), in order to send over > the corrected version of the partition, then send it over to the view. > Without an index in place, we have to do yet another scan, per-affected > range. > > > > >> > > > > >> Consider the case of a single corrupted SSTable on the view > that's removed from the filesystem, or the data is simply missing after > being restored from an inconsistent backup. It presumably contains lots of > partitions, which maps to base partitions all over the cluster, in a lot of > different token ranges. For every one of those ranges (hundreds, to tens > of thousands of them given the checkerboard design), when finding the > missing data in the base, you'll have to perform a compaction across all > the SSTables that potentially contain the missing data just to rebuild the > view-oriented partitions that need to be sent to the view. The complexity > of this operation can be looked at as O(N*M) where N and M are the number > of ranges in the base table and the view affected by the corruption, > respectively. Without an index in place, finding the missing data is very > expensive. We potentially have to do it several times on each node, > depending on the size of the range. Smaller ranges increase the size of > the board exponentially, larger ranges increase the number of SSTables that > would be involved in each compaction. > > > > >> > > > > >> Then you send that data over to the view, the view does it's > anti-compaction thing, again, once per affected range. So now the view has > to do an anti-compaction once per block on the board that's affected by the > missing data. > > > > >> > > > > >> Doing hundreds or thousands of these will add up pretty quickly. > > > > >> > > > > >> When I said that a repair could take months, this is what I had > in mind. > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston < > bl...@ultrablake.com> wrote: > > > > >>> __ > > > > >>> > Adds overhead in the hot path due to maintaining indexes. > Extra memory needed during write path and compaction. > > > > >>> > > > > >>> I’d make the same argument about the overhead of maintaining the > index that Jon just made about the disk space required. The relatively > predictable overhead of maintaining the index as part of the write and > compaction paths is a pro, not a con. Although you’re not always paying the > cost of building a merkle tree with snapshot repair, it can impact the hot > path and you do have to plan for it. > > > > >>> > > > > >>> > Verifies index content, not actual data—may miss > low-probability errors like bit flips > > > > >>> > > > > >>> Presumably this could be handled by the views performing repair > against each other? You could also periodically rebuild the index or > perform checksums against the sstable content. > > > > >>> > > > > >>> > Extra data scan during inconsistency detection > > > > >>> > Index: Since the data covered by certain indexes is not > guaranteed to be fully contained within a single node as the topology > changes, some data scans may be wasted. > > > > >>> > Snapshots: No extra data scan > > > > >>> > > > > >>> This isn’t really the whole story. The amount of wasted scans on > index repairs is negligible. If a difference is detected with snapshot > repairs though, you have to read the entire partition from both the view > and base table to calculate what needs to be fixed. > > > > >>> > > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: > > > > >>>> One practical aspect that isn't immediately obvious is the disk > space consideration for snapshots. > > > > >>>> > > > > >>>> When you have a table with a mixed workload using LCS or UCS > with scaling parameters like L10 and initiate a repair, the disk usage will > increase as long as the snapshot persists and the table continues to > receive writes. This aspect is understood and factored into the design. > > > > >>>> > > > > >>>> However, a more nuanced point is the necessity to maintain > sufficient disk headroom specifically for running repairs. This echoes the > challenge with STCS compaction, where enough space must be available to > accommodate the largest SSTables, even when they are not being actively > compacted. > > > > >>>> > > > > >>>> For example, if a repair involves rewriting 100GB of SSTable > data, you'll consistently need to reserve 100GB of free space to facilitate > this. > > > > >>>> > > > > >>>> Therefore, while the snapshot-based approach leads to variable > disk space utilization, operators must provision storage as if the maximum > potential space will be used at all times to ensure repairs can be executed. > > > > >>>> > > > > >>>> This introduces a rate of churn dynamic, where the write > throughput dictates the required extra disk space, rather than the existing > on-disk data volume. > > > > >>>> > > > > >>>> If 50% of your SSTables are rewritten during a snapshot, you > would need 50% free disk space. Depending on the workload, the snapshot > method could consume significantly more disk space than an index-based > approach. Conversely, for relatively static workloads, the index method > might require more space. It's not as straightforward as stating "No extra > disk space needed". > > > > >>>> > > > > >>>> Jon > > > > >>>> > > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> > wrote: > > > > >>>>> > Regarding your comparison between approaches, I think you > also need to take into account the other dimensions that have been brought > up in this thread. Things like minimum repair times and vulnerability to > outages and topology changes are the first that come to mind. > > > > >>>>> > > > > >>>>> Sure, I added a few more points. > > > > >>>>> > > > > >>>>> *Perspective* > > > > >>>>> > > > > >>>>> *Index-Based Solution* > > > > >>>>> > > > > >>>>> *Snapshot-Based Solution* > > > > >>>>> > > > > >>>>> 1. Hot path overhead > > > > >>>>> > > > > >>>>> Adds overhead in the hot path due to maintaining indexes. > Extra memory needed during write path and compaction. > > > > >>>>> > > > > >>>>> No impact on the hot path > > > > >>>>> > > > > >>>>> 2. Extra disk usage when repair is not running > > > > >>>>> > > > > >>>>> Requires additional disk space to store persistent indexes > > > > >>>>> > > > > >>>>> No extra disk space needed > > > > >>>>> > > > > >>>>> 3. Extra disk usage during repair > > > > >>>>> > > > > >>>>> Minimal or no additional disk usage > > > > >>>>> > > > > >>>>> Requires additional disk space for snapshots > > > > >>>>> > > > > >>>>> 4. Fine-grained repair to deal with emergency situations / > topology changes > > > > >>>>> > > > > >>>>> Supports fine-grained repairs by targeting specific index > ranges. This allows repair to be retried on smaller data sets, enabling > incremental progress when repairing the entire table. This is especially > helpful when there are down nodes or topology changes during repair, which > are common in day-to-day operations. > > > > >>>>> > > > > >>>>> Coordination across all nodes is required over a long period > of time. For each round of repair, if all replica nodes are down or if > there is a topology change, the data ranges that were not covered will need > to be repaired in the next round. > > > > >>>>> > > > > >>>>> > > > > >>>>> 5. Validating data used in reads directly > > > > >>>>> > > > > >>>>> Verifies index content, not actual data—may miss > low-probability errors like bit flips > > > > >>>>> > > > > >>>>> Verifies actual data content, providing stronger correctness > guarantees > > > > >>>>> > > > > >>>>> 6. Extra data scan during inconsistency detection > > > > >>>>> > > > > >>>>> Since the data covered by certain indexes is not guaranteed to > be fully contained within a single node as the topology changes, some data > scans may be wasted. > > > > >>>>> > > > > >>>>> No extra data scan > > > > >>>>> > > > > >>>>> 7. The overhead of actual data repair after an inconsistency > is detected > > > > >>>>> > > > > >>>>> Only indexes are streamed to the base table node, and the > actual data being fixed can be as accurate as the row level. > > > > >>>>> > > > > >>>>> Anti-compaction is needed on the MV table, and potential > over-streaming may occur due to the lack of row-level insight into data > quality. > > > > >>>>> > > > > >>>>> > > > > >>>>> > one of my biggest concerns I haven't seen discussed much is > LOCAL_SERIAL/SERIAL on read > > > > >>>>> > > > > >>>>> Paxos v2 introduces an optimization where serial reads can be > completed in just one round trip, reducing latency compared to traditional > Paxos which may require multiple phases. > > > > >>>>> > > > > >>>>> > I think a refresh would be low-cost and give users the > flexibility to run them however they want. > > > > >>>>> > > > > >>>>> I think this is an interesting idea. Does it suggest that the > MV should be rebuilt on a regular schedule? It sounds like an extension of > the snapshot-based approach—rather than detecting mismatches, we would > periodically reconstruct a clean version of the MV based on the snapshot. > This seems to diverge from the current MV model in Cassandra, where > consistency between the MV and base table must be maintained continuously. > This could be an extension of the CEP-48 work, where the MV is periodically > rebuilt from a snapshot of the base table, assuming the user can tolerate > some level of staleness in the MV data. > > > > >>>>> > > > > >>> > > > > > > > > > > > > > > > > > > >