Definitely want to avoid scope creep, *however*... ;) What if the CEP includes an interface for MV repair that calls out to some user pluggable solution and the spark-based solution you've developed is the first / only reference solution available at the outset? That way we could integrate it into the control plane (nodetool, JMX, a post CEP-37 world), have a solution available for those comfortable taking on the spark dependency, and have a clear paved path for future anti-entropy work on where to fit into the ecosystem.
On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote: > Based on our discussion, it seems we’ve reached broad agreement on the > hot‑path optimization for strict MV consistency mode. We still have concerns > about the different approaches for the repair path. I’ve updated the CEP’s > scope and retitled it *‘Cassandra Materialized View Consistency, Reliability > & Backfill Enhancement’.* I removed the repair section and added a backfill > section that proposes an optimized strategy to backfill MVs from the base > table. Please take a look and share your thoughts. I believe this backfill > strategy will also benefit future MV repair work. > > Regarding repair, > > While repair is essential to maintain Materialized View (MV) consistency in > the face of one-off bugs or bit rot, implementing a robust, native repair > mechanism within Apache Cassandra remains a highly complex and still largely > theoretical challenge, irrespective of the strategies considered in prior > discussions. > > To ensure steady progress, we propose addressing the problem incrementally. > This CEP focuses on making the hot-path & backfill reliable, laying a solid > foundation. A future CEP can then address repair mechanisms within the Apache > Cassandra ecosystem in a more focused and well-scoped manner. > > For now, the proposed CEP clarifies that users may choose to either evaluate > the external Spark-based repair tool > <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> (which is > not part of this proposal) or develop their own repair solution tailored to > their operational needs. > > On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu <curly...@gmail.com> wrote: >> I believe that, with careful design, we could make the row-level repair work >> using the index-based approach. >> However, regarding the CEP-proposed solution, I want to highlight that the >> entire repair job can be divided into two parts: >> >> 1. Inconsistency detection >> >> 2. Rebuilding the inconsistent ranges >> >> The second step could look like this: >> >> nodetool mv_rebuild --base_table_range <b_start,b_end> --mv_ranges >> [<mv_start1, mv_end1>, <mv_start2, mv_end2>, ...] >> >> >> >> The base table range and MV ranges would come from the first step. But it’s >> also possible to run this rebuild directly—without detecting >> inconsistencies—if we already know that some nodes need repair. This means >> we don’t have to wait for the snapshot processing to start the repair. >> Snapshot-based detection works well when the cluster is generally healthy. >> >> Compared to the current rebuild, the proposed approach doesn’t require >> dropping the MV and rebuilding it from scratch across the entire >> cluster—which is a major blocker for production environments. >> >> That said, for the index-based approach, I still think it introduces >> additional load on both the write path and compaction. This increases the >> resource cost of using MVs. While it’s a nice feature to enable row-level MV >> repair, its implementation is complex. On the other hand, the original >> proposal’s rebuild strategy is more versatile and likely to perform >> better—especially when only a few nodes need repair. In contrast, >> large-scale row-level comparisons via the index-based approach could be >> prohibitively expensive. Just to clarify my intention here: this is not a >> complete 'no' to the index-based approach. However, for the initial version, >> I believe it's more prudent to avoid impacting the hot path. We can start >> with a simpler design that keeps the critical path untouched, and then >> iterate to make it more robust over time—much like how other features in >> Apache Cassandra have evolved. For example, 'full repair' came first, >> followed by 'incremental repair'; similarly, STCS was later complemented by >> LCS. This phased evolution allows us to balance safety, stability, and >> long-term capability. >> >> Given all this, I still prefer to move forward with the original proposal. >> It allows us to avoid paying the repair overhead most of the time. >> >> >> >> On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston <bl...@ultrablake.com> wrote: >>> __ >>> Those are both fair points. Once you start dealing with data loss though, >>> maintaining guarantees is often impossible, so I’m not sure that torn >>> writes or updated timestamps are dealbreakers, but I’m fine with tabling >>> option 2 for now and seeing if we can figure out something better. >>> >>> Regarding the assassin cells, if you wanted to avoid explicitly agreeing on >>> a value, you might be able to only issue them for repaired base data, which >>> has been implicitly agreed upon. >>> >>> I think that or something like it is worth exploring. The idea would be to >>> solve this issue as completely as anti-compaction would - but without >>> having to rewrite sstables. I’d be interested to hear any ideas you have >>> about how that might work. >>> >>> You basically need a mechanism to erase some piece of data that was written >>> before a given wall clock time - regardless of cell timestamp, and without >>> precluding future updates (in wall clock time) with earlier timestamps. >>> >>> On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote: >>>> In the second option, we use the repair timestamp to re-update any cell or >>>> row we want to fix in the base table. This approach is problematic because >>>> it alters the write time of user-supplied data. Although Cassandra does >>>> not allow users to set timestamps for LWT writes, users may still rely on >>>> the update time. A key limitation of this approach is that it cannot fix >>>> cases where a view cell ends up in a future state while the base table >>>> remains correct. I now understand your point that Cassandra cannot handle >>>> this scenario today. However, as I mentioned earlier, the important >>>> distinction is that when this issue occurs in the base table, we accept >>>> the "incorrect" data as valid—but this is not acceptable for materialized >>>> views, since the source of truth (the base table) still holds the correct >>>> data. >>>> >>>> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston <bl...@ultrablake.com> >>>> wrote: >>>>> __ >>>>> > Sorry, Blake—I was traveling last week and couldn’t reply to your email >>>>> > sooner. >>>>> >>>>> No worries, I’ll be taking some time off soon as well. >>>>> >>>>> > I don’t think the first or second option is ideal. We should treat the >>>>> > base table as the source of truth. Modifying it—or forcing an update on >>>>> > it, even if it’s just a timestamp change—is not a good approach and >>>>> > won’t solve all problems. >>>>> >>>>> I agree the first option probably isn’t the right way to go. Could you >>>>> say a bit more about why the second option is not a good approach and >>>>> which problems it won’t solve? >>>>> >>>>> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote: >>>>>> Sorry, Blake—I was traveling last week and couldn’t reply to your email >>>>>> sooner. >>>>>> >>>>>> > First - we interpret view data with higher timestamps than the base >>>>>> > table as data that’s missing from the base and replicate it into the >>>>>> > base table. The timestamp of the missing data may be below the paxos >>>>>> > timestamp low bound so we’d have to adjust the paxos coordination >>>>>> > logic to allow that in this case. Depending on how the view got this >>>>>> > way it may also tear writes to the base table, breaking the write >>>>>> > atomicity promise. >>>>>> >>>>>> As discussed earlier, we want this MV repair mechanism to handle all >>>>>> edge cases. However, it would be difficult to design it in a way that >>>>>> detects the root cause of each mismatch and repairs it accordingly. >>>>>> Additionally, as you mentioned, this approach could introduce other >>>>>> issues, such as violating the write atomicity guarantee. >>>>>> >>>>>> > Second - If this happens it means that we’ve either lost base table >>>>>> > data or paxos metadata. If that happened, we could force a base table >>>>>> > update that rewrites the current base state with new timestamps making >>>>>> > the extra view data removable. However this wouldn’t fix the case >>>>>> > where the view cell has a timestamp from the future - although that’s >>>>>> > not a case that C* can fix today either. >>>>>> >>>>>> I don’t think the first or second option is ideal. We should treat the >>>>>> base table as the source of truth. Modifying it—or forcing an update on >>>>>> it, even if it’s just a timestamp change—is not a good approach and >>>>>> won’t solve all problems. >>>>>> >>>>>> > the idea to use anti-compaction makes a lot more sense now (in >>>>>> > principle - I don’t think it’s workable in practice) >>>>>> >>>>>> I have one question regarding anti-compaction. Is the main concern that >>>>>> processing too much data during anti-compaction could cause issues for >>>>>> the cluster? >>>>>> >>>>>> > I guess you could add some sort of assassin cell that is meant to >>>>>> > remove a cell with a specific timestamp and value, but is otherwise >>>>>> > invisible. >>>>>> >>>>>> The idea of the assassination cell is interesting. To prevent data from >>>>>> being incorrectly removed during the repair process, we need to ensure a >>>>>> quorum of nodes is available and agrees on the same value before >>>>>> repairing a materialized view (MV) row or cell. However, this could be >>>>>> very expensive, as it requires coordination to repair even a single row. >>>>>> >>>>>> I think there are a few key differences between MV repair and normal >>>>>> anti-entropy repair: >>>>>> >>>>>> 1. For normal anti-entropy repair, there is no single source of truth. >>>>>> All replicas act as sources of truth, and eventual consistency is >>>>>> achieved by merging data across them. Even if some data is lost or bugs >>>>>> result in future timestamps, the replicas will eventually converge on >>>>>> the same (possibly incorrect) value, and that value becomes the accepted >>>>>> truth. In contrast, MV repair relies on the base table as the source of >>>>>> truth and modifies the MV if inconsistencies are detected. This means >>>>>> that in some cases, simply merging data won’t resolve the issue, since >>>>>> Cassandra resolves conflicts using timestamps and there’s no guarantee >>>>>> the base table will always "win" unless we change the MV merging >>>>>> logic—which I’ve been trying to avoid. >>>>>> >>>>>> 2. I see MV repair as more of an on-demand operation, whereas normal >>>>>> anti-entropy repair needs to run regularly. This means we shouldn’t >>>>>> treat MV repair the same way as existing repairs. When an operator >>>>>> initiates MV repair, they need to ensure that sufficient resources are >>>>>> available to support it. >>>>>> >>>>>> >>>>>> On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <bl...@ultrablake.com> >>>>>> wrote: >>>>>>> __ >>>>>>> Got it, thanks for clearing that up. I’d seen the strict liveness code >>>>>>> around but didn’t realize it was MV related and hadn’t dug into what it >>>>>>> did or how it worked. >>>>>>> >>>>>>> I think you’re right about the row liveness update working for extra >>>>>>> data with timestamps lower than the most recent base table update. >>>>>>> >>>>>>> I see what you mean about the timestamp from the future case. I thought >>>>>>> of 3 options: >>>>>>> >>>>>>> First - we interpret view data with higher timestamps than the base >>>>>>> table as data that’s missing from the base and replicate it into the >>>>>>> base table. The timestamp of the missing data may be below the paxos >>>>>>> timestamp low bound so we’d have to adjust the paxos coordination logic >>>>>>> to allow that in this case. Depending on how the view got this way it >>>>>>> may also tear writes to the base table, breaking the write atomicity >>>>>>> promise. >>>>>>> >>>>>>> Second - If this happens it means that we’ve either lost base table >>>>>>> data or paxos metadata. If that happened, we could force a base table >>>>>>> update that rewrites the current base state with new timestamps making >>>>>>> the extra view data removable. However this wouldn’t fix the case where >>>>>>> the view cell has a timestamp from the future - although that’s not a >>>>>>> case that C* can fix today either. >>>>>>> >>>>>>> Third - we add a new tombstone type or some mechanism to delete >>>>>>> specific cells that doesn’t preclude correct writes with lower >>>>>>> timestamps from being visible. I’m not sure how this would work, and >>>>>>> the idea to use anti-compaction makes a lot more sense now (in >>>>>>> principle - I don’t think it’s workable in practice). I guess you could >>>>>>> add some sort of assassin cell that is meant to remove a cell with a >>>>>>> specific timestamp and value, but is otherwise invisible. This seems >>>>>>> dangerous though, since it’s likely there’s a replication problem that >>>>>>> may resolve itself and the repair process would actually be removing >>>>>>> data that the user intended to write. >>>>>>> >>>>>>> Paulo - I don’t think storage changes are off the table, but they do >>>>>>> expand the scope and risk of the proposal, so we should be careful. >>>>>>> >>>>>>> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote: >>>>>>>> > I’m not sure if this is the only edge case—there may be other >>>>>>>> issues as well. I’m also unsure whether we should redesign the >>>>>>>> tombstone handling for MVs, since that would involve changes to the >>>>>>>> storage engine. To minimize impact there, the original proposal was to >>>>>>>> rebuild the affected ranges using anti-compaction, just to be safe. >>>>>>>> >>>>>>>> I haven't been following the discussion but I think one of the issues >>>>>>>> with the materialized view "strict liveness" fix[1] is that we avoided >>>>>>>> making invasive changes to the storage engine at the time, but this >>>>>>>> was considered by Zhao on [1]. I think we shouldn't be trying to avoid >>>>>>>> updates to the storage format as part of the MV implementation, if >>>>>>>> this is what it takes to make MVs V2 reliable. >>>>>>>> >>>>>>>> [1] - >>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603 >>>>>>>> >>>>>>>> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <curly...@gmail.com> wrote: >>>>>>>>> The current design leverages strict liveness to shadow the old view >>>>>>>>> row. When the view-indexed value changes from 'a' to 'b', no >>>>>>>>> tombstone is written; instead, the old row is marked as expired by >>>>>>>>> updating its liveness info with the timestamp of the change. If the >>>>>>>>> column is later set back to 'a', the view row is re-inserted with a >>>>>>>>> new, non-expired liveness info reflecting the latest timestamp. >>>>>>>>> >>>>>>>>> To delete an extra row in the materialized view (MV), we can likely >>>>>>>>> use the same approach—marking it as shadowed by updating the liveness >>>>>>>>> info. However, in the case of inconsistencies where a column in the >>>>>>>>> MV has a higher timestamp than the corresponding column in the base >>>>>>>>> table, this row-level liveness mechanism is insufficient. >>>>>>>>> >>>>>>>>> Even for the case where we delete the row by marking its liveness >>>>>>>>> info as expired during repair, there are concerns. Since this >>>>>>>>> introduces a data mutation as part of the repair process, it’s >>>>>>>>> unclear whether there could be edge cases we’re missing. This >>>>>>>>> approach may risk unexpected side effects if the repair logic is not >>>>>>>>> carefully aligned with write path semantics. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston >>>>>>>>> <bl...@ultrablake.com> wrote: >>>>>>>>>> __ >>>>>>>>>> That’s a good point, although as described I don’t think that could >>>>>>>>>> ever work properly, even in normal operation. Either we’re >>>>>>>>>> misunderstanding something, or this is a flaw in the current MV >>>>>>>>>> design. >>>>>>>>>> >>>>>>>>>> Assuming changing the view indexed column results in a tombstone >>>>>>>>>> being applied to the view row for the previous value, if we wrote >>>>>>>>>> the other base columns (the non view indexed ones) to the view with >>>>>>>>>> the same timestamps they have on the base, then changing the view >>>>>>>>>> indexed value from ‘a’ to ‘b’, then back to ‘a’ would always cause >>>>>>>>>> this problem. I think you’d need to always update the column >>>>>>>>>> timestamps on the view to be >= the view column timestamp on the base >>>>>>>>>> >>>>>>>>>> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote: >>>>>>>>>>> > In the case of a missed update, we'll have a new value and we can >>>>>>>>>>> > send a tombstone to the view with the timestamp of the most >>>>>>>>>>> > recent update. >>>>>>>>>>> >>>>>>>>>>> > then something has gone wrong and we should issue a tombstone >>>>>>>>>>> > using the paxos repair timestamp as the tombstone timestamp. >>>>>>>>>>> >>>>>>>>>>> The current MV implementation uses “strict liveness” to determine >>>>>>>>>>> whether a row is live. I believe that using regular tombstones >>>>>>>>>>> during repair could cause problems. For example, consider a base >>>>>>>>>>> table with schema (pk, ck, v1, v2) and a materialized view with >>>>>>>>>>> schema (v1, pk, ck) -> v2. If, for some reason, we detect an extra >>>>>>>>>>> row in the MV and delete it using a tombstone with the latest >>>>>>>>>>> update timestamp, we may run into issues. Suppose we later update >>>>>>>>>>> the base table’s v1 field to match the MV row we previously >>>>>>>>>>> deleted, and the v2 value now has an older timestamp. In that case, >>>>>>>>>>> the previously issued tombstone could still shadow the v2 column, >>>>>>>>>>> which is unintended. That is why I was asking if we are going to >>>>>>>>>>> introduce a new kind of tombstones. I’m not sure if this is the >>>>>>>>>>> only edge case—there may be other issues as well. I’m also unsure >>>>>>>>>>> whether we should redesign the tombstone handling for MVs, since >>>>>>>>>>> that would involve changes to the storage engine. To minimize >>>>>>>>>>> impact there, the original proposal was to rebuild the affected >>>>>>>>>>> ranges using anti-compaction, just to be safe. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston >>>>>>>>>>> <bl...@ultrablake.com> wrote: >>>>>>>>>>>> __ >>>>>>>>>>>>> Extra row in MV (assuming the tombstone is gone in the base >>>>>>>>>>>>> table) — how should we fix this? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This would mean that the base table had either updated or deleted >>>>>>>>>>>> a row and the view didn't receive the corresponding delete. >>>>>>>>>>>> >>>>>>>>>>>> In the case of a missed update, we'll have a new value and we can >>>>>>>>>>>> send a tombstone to the view with the timestamp of the most recent >>>>>>>>>>>> update. Since timestamps issued by paxos and accord writes are >>>>>>>>>>>> always increasing monotonically and don't have collisions, this is >>>>>>>>>>>> safe. >>>>>>>>>>>> >>>>>>>>>>>> In the case of a row deletion, we'd also want to send a tombstone >>>>>>>>>>>> with the same timestamp, however since tombstones can be purged, >>>>>>>>>>>> we may not have that information and would have to treat it like >>>>>>>>>>>> the view has a higher timestamp than the base table. >>>>>>>>>>>> >>>>>>>>>>>>> Inconsistency (timestamps don’t match) — it’s easy to fix when >>>>>>>>>>>>> the base table has higher timestamps, but how do we resolve it >>>>>>>>>>>>> when the MV columns have higher timestamps? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> There are 2 ways this could happen. First is that a write failed >>>>>>>>>>>> and paxos repair hasn't completed it, which is expected, and the >>>>>>>>>>>> second is a replication bug or base table data loss. You'd need to >>>>>>>>>>>> compare the view timestamp to the paxos repair history to tell >>>>>>>>>>>> which it is. If the view timestamp is higher than the most recent >>>>>>>>>>>> paxos repair timestamp for the key, then it may just be a failed >>>>>>>>>>>> write and we should do nothing. If the view timestamp is less than >>>>>>>>>>>> the most recent paxos repair timestamp for that key and higher >>>>>>>>>>>> than the base timestamp, then something has gone wrong and we >>>>>>>>>>>> should issue a tombstone using the paxos repair timestamp as the >>>>>>>>>>>> tombstone timestamp. This is safe to do because the paxos repair >>>>>>>>>>>> timestamps act as a low bound for ballots paxos will process, so >>>>>>>>>>>> it wouldn't be possible for a legitimate write to be shadowed by >>>>>>>>>>>> this tombstone. >>>>>>>>>>>> >>>>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the >>>>>>>>>>>>> rows in the MV for cases 2 and 3? If yes, how will this tombstone >>>>>>>>>>>>> work? If no, how should we fix the MV data? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> No, a normal tombstone would work. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote: >>>>>>>>>>>>> Okay, let’s put the efficiency discussion on hold for now. I want >>>>>>>>>>>>> to make sure the actual repair process after detecting >>>>>>>>>>>>> inconsistencies will work with the index-based solution. >>>>>>>>>>>>> >>>>>>>>>>>>> When a mismatch is detected, the MV replica will need to stream >>>>>>>>>>>>> its index file to the base table replica. The base table will >>>>>>>>>>>>> then perform a comparison between the two files. >>>>>>>>>>>>> >>>>>>>>>>>>> There are three cases we need to handle: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Missing row in MV — this is straightforward; we can propagate >>>>>>>>>>>>> the data to the MV. >>>>>>>>>>>>> >>>>>>>>>>>>> 2. Extra row in MV (assuming the tombstone is gone in the base >>>>>>>>>>>>> table) — how should we fix this? >>>>>>>>>>>>> >>>>>>>>>>>>> 3. Inconsistency (timestamps don’t match) — it’s easy to fix >>>>>>>>>>>>> when the base table has higher timestamps, but how do we resolve >>>>>>>>>>>>> it when the MV columns have higher timestamps? >>>>>>>>>>>>> >>>>>>>>>>>>> Do we need to introduce a new kind of tombstone to shadow the >>>>>>>>>>>>> rows in the MV for cases 2 and 3? If yes, how will this tombstone >>>>>>>>>>>>> work? If no, how should we fix the MV data? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston >>>>>>>>>>>>> <bl...@ultrablake.com> wrote: >>>>>>>>>>>>>> __ >>>>>>>>>>>>>> > hopefully we can come up with a solution that everyone agrees >>>>>>>>>>>>>> > on. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’m sure we can, I think we’ve been making good progress >>>>>>>>>>>>>> >>>>>>>>>>>>>> > My main concern with the index-based solution is the overhead >>>>>>>>>>>>>> > it adds to the hot path, as well as having to build indexes >>>>>>>>>>>>>> > periodically. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So the additional overhead of maintaining a storage attached >>>>>>>>>>>>>> index on the client write path is pretty minimal - it’s >>>>>>>>>>>>>> basically adding data to an in memory trie. It’s a little extra >>>>>>>>>>>>>> work and memory usage, but there isn’t any extra io or other >>>>>>>>>>>>>> blocking associated with it. I’d expect the latency impact to be >>>>>>>>>>>>>> negligible. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > As mentioned earlier, this MV repair should be an infrequent >>>>>>>>>>>>>> > operation >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don’t this that’s a safe assumption. There are a lot of >>>>>>>>>>>>>> situations outside of data loss bugs where repair would need to >>>>>>>>>>>>>> be run. >>>>>>>>>>>>>> >>>>>>>>>>>>>> These use cases could probably be handled by repairing the view >>>>>>>>>>>>>> with other view replicas: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Scrubbing corrupt sstables >>>>>>>>>>>>>> Node replacement via backup >>>>>>>>>>>>>> >>>>>>>>>>>>>> These use cases would need an actual MV repair to check >>>>>>>>>>>>>> consistency with the base table: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Restoring a cluster from a backup >>>>>>>>>>>>>> Imported sstables via nodetool import >>>>>>>>>>>>>> Data loss from operator error >>>>>>>>>>>>>> Proactive consistency checks - ie preview repairs >>>>>>>>>>>>>> >>>>>>>>>>>>>> Even if it is an infrequent operation, when operators need it, >>>>>>>>>>>>>> it needs to be available and reliable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It’s a fact that there are clusters where non-incremental >>>>>>>>>>>>>> repairs are run on a cadence of a week or more to manage the >>>>>>>>>>>>>> overhead of validation compactions. Assuming the cluster doesn’t >>>>>>>>>>>>>> have any additional headroom, that would mean that any one of >>>>>>>>>>>>>> the above events could cause views to remain out of sync for up >>>>>>>>>>>>>> to a week while the full set of merkle trees is being built. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This delay eliminates a lot of the value of repair as a risk >>>>>>>>>>>>>> mitigation tool. If I had to make a recommendation where a bad >>>>>>>>>>>>>> call could cost me my job, the prospect of a 7 day delay on >>>>>>>>>>>>>> repair would mean a strong no. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Some users also run preview repair continuously to detect data >>>>>>>>>>>>>> consistency errors, so at least a subset of users will probably >>>>>>>>>>>>>> be running MV repairs continuously - at least in preview mode. >>>>>>>>>>>>>> >>>>>>>>>>>>>> That’s why I say that the replication path should be designed to >>>>>>>>>>>>>> never need repair, and MV repair should be designed to be >>>>>>>>>>>>>> prepared for the worst. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > I’m wondering if it’s possible to enable or disable index >>>>>>>>>>>>>> > building dynamically so that we don’t always incur the cost >>>>>>>>>>>>>> > for something that’s rarely needed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think this would be a really reasonable compromise as long as >>>>>>>>>>>>>> the default is on. That way it’s as safe as possible by default, >>>>>>>>>>>>>> but users who don’t care or have a separate system for repairing >>>>>>>>>>>>>> MVs can opt out. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > I’m not sure what you mean by “data problems” here. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I mean out of sync views - either due to bugs, operator error, >>>>>>>>>>>>>> corruption, etc >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Also, this does scale with cluster size—I’ve compared it to >>>>>>>>>>>>>> > full repair, and this MV repair should behave similarly. That >>>>>>>>>>>>>> > means as long as full repair works, this repair should work as >>>>>>>>>>>>>> > well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You could build the merkle trees at about the same cost as a >>>>>>>>>>>>>> full repair, but the actual data repair path is completely >>>>>>>>>>>>>> different for MV, and that’s the part that doesn’t scale well. >>>>>>>>>>>>>> As you know, with normal repair, we just stream data for ranges >>>>>>>>>>>>>> detected as out of sync. For Mvs, since the data isn’t in base >>>>>>>>>>>>>> partition order, the view data for an out of sync view range >>>>>>>>>>>>>> needs to be read out and streamed to every base replica that >>>>>>>>>>>>>> it’s detected a mismatch against. So in the example I gave with >>>>>>>>>>>>>> the 300 node cluster, you’re looking at reading and transmitting >>>>>>>>>>>>>> the same partition at least 100 times in the best case, and the >>>>>>>>>>>>>> cost of this keeps going up as the cluster increases in size. >>>>>>>>>>>>>> That's the part that doesn't scale well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is also one the benefits of the index design. Since it >>>>>>>>>>>>>> stores data in segments that roughly correspond to points on the >>>>>>>>>>>>>> grid, you’re not rereading the same data over and over. A repair >>>>>>>>>>>>>> for a given grid point only reads an amount of data proportional >>>>>>>>>>>>>> to the data in common for the base/view grid point, and it’s >>>>>>>>>>>>>> stored in a small enough granularity that the base can calculate >>>>>>>>>>>>>> what data needs to be sent to the view without having to read >>>>>>>>>>>>>> the entire view partition. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote: >>>>>>>>>>>>>>> Thanks, Blake. I’m open to iterating on the design, and >>>>>>>>>>>>>>> hopefully we can come up with a solution that everyone agrees >>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My main concern with the index-based solution is the overhead >>>>>>>>>>>>>>> it adds to the hot path, as well as having to build indexes >>>>>>>>>>>>>>> periodically. As mentioned earlier, this MV repair should be an >>>>>>>>>>>>>>> infrequent operation, but the index-based approach shifts some >>>>>>>>>>>>>>> of the work to the hot path in order to allow repairs that >>>>>>>>>>>>>>> touch only a few nodes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m wondering if it’s possible to enable or disable index >>>>>>>>>>>>>>> building dynamically so that we don’t always incur the cost for >>>>>>>>>>>>>>> something that’s rarely needed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> > it degrades operators ability to react to data problems by >>>>>>>>>>>>>>> > imposing a significant upfront processing burden on repair, >>>>>>>>>>>>>>> > and that it doesn’t scale well with cluster size >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m not sure what you mean by “data problems” here. Also, this >>>>>>>>>>>>>>> does scale with cluster size—I’ve compared it to full repair, >>>>>>>>>>>>>>> and this MV repair should behave similarly. That means as long >>>>>>>>>>>>>>> as full repair works, this repair should work as well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For example, regardless of how large the cluster is, you can >>>>>>>>>>>>>>> always enable Merkle tree building on 10% of the nodes at a >>>>>>>>>>>>>>> time until all the trees are ready. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I understand that coordinating this type of repair is harder >>>>>>>>>>>>>>> than what we currently support, but with CEP-37, we should be >>>>>>>>>>>>>>> able to handle this coordination without adding too much burden >>>>>>>>>>>>>>> on the operator side. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston >>>>>>>>>>>>>>> <bl...@ultrablake.com> wrote: >>>>>>>>>>>>>>>> __ >>>>>>>>>>>>>>>>> I don't see any outcome here that is good for the community >>>>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he >>>>>>>>>>>>>>>>> (and I) consider inferior, or he is prevented from >>>>>>>>>>>>>>>>> contributing this work. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t >>>>>>>>>>>>>>>> a competition. We can collaborate and figure out the best >>>>>>>>>>>>>>>> approach to the problem. I’d like to keep discussing it if >>>>>>>>>>>>>>>> you’re open to iterating on the design. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m not married to our proposal, it’s just the cleanest way we >>>>>>>>>>>>>>>> could think of to address what Jon and I both see as blockers >>>>>>>>>>>>>>>> in the current proposal. It’s not set in stone though. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote: >>>>>>>>>>>>>>>>> Hmm, I am very surprised as I helped write that and I >>>>>>>>>>>>>>>>> distinctly recall a specific goal was avoiding binding vetoes >>>>>>>>>>>>>>>>> as they're so toxic. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ok, I guess you can block this work if you like. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't see any outcome here that is good for the community >>>>>>>>>>>>>>>>> though. Either Runtian caves and adopts your design that he >>>>>>>>>>>>>>>>> (and I) consider inferior, or he is prevented from >>>>>>>>>>>>>>>>> contributing this work. That isn't a functioning community in >>>>>>>>>>>>>>>>> my mind, so I'll be noping out for a while, as I don't see >>>>>>>>>>>>>>>>> much value here right now. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 2025/06/06 18:31:08 Blake Eggleston wrote: >>>>>>>>>>>>>>>>> > Hi Benedict, that’s actually not true. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Here’s a link to the project governance page: >>>>>>>>>>>>>>>>> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_ >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > The CEP section says: >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > “*Once the proposal is finalized and any major committer >>>>>>>>>>>>>>>>> > dissent reconciled, call a [VOTE] on the ML to have the >>>>>>>>>>>>>>>>> > proposal adopted. The criteria for acceptance is consensus >>>>>>>>>>>>>>>>> > (3 binding +1 votes and no binding vetoes). The vote should >>>>>>>>>>>>>>>>> > remain open for 72 hours.*” >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > So they’re definitely vetoable. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > Also note the part about “*Once the proposal is finalized >>>>>>>>>>>>>>>>> > and any major committer dissent reconciled,*” being a >>>>>>>>>>>>>>>>> > prerequisite for moving a CEP to [VOTE]. Given the as yet >>>>>>>>>>>>>>>>> > unreconciled committer dissent, it wouldn’t even be >>>>>>>>>>>>>>>>> > appropriate to move to a VOTE until we get to the bottom of >>>>>>>>>>>>>>>>> > this repair discussion. >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith >>>>>>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>>>>> > > > but the snapshot repair design is not a viable path >>>>>>>>>>>>>>>>> > > > forward. It’s the first iteration of a repair design. >>>>>>>>>>>>>>>>> > > > We’ve proposed a second iteration, and we’re open to a >>>>>>>>>>>>>>>>> > > > third iteration. >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > I shan't be participating further in discussion, but I >>>>>>>>>>>>>>>>> > > want to make a point of order. The CEP process has no >>>>>>>>>>>>>>>>> > > vetoes, so you are not empowered to declare that a design >>>>>>>>>>>>>>>>> > > is not viable without the input of the wider community. >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > On 2025/06/05 03:58:59 Blake Eggleston wrote: >>>>>>>>>>>>>>>>> > > > You can detect and fix the mismatch in a single round >>>>>>>>>>>>>>>>> > > > of repair, but the amount of work needed to do it is >>>>>>>>>>>>>>>>> > > > _significantly_ higher with snapshot repair. Consider a >>>>>>>>>>>>>>>>> > > > case where we have a 300 node cluster w/ RF 3, where >>>>>>>>>>>>>>>>> > > > each view partition contains entries mapping to every >>>>>>>>>>>>>>>>> > > > token range in the cluster - so 100 ranges. If we lose >>>>>>>>>>>>>>>>> > > > a view sstable, it will affect an entire row/column of >>>>>>>>>>>>>>>>> > > > the grid. Repair is going to scan all data in the >>>>>>>>>>>>>>>>> > > > mismatching view token ranges 100 times, and each base >>>>>>>>>>>>>>>>> > > > range once. So you’re looking at 200 range scans. >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > Now, you may argue that you can merge the duplicate >>>>>>>>>>>>>>>>> > > > view scans into a single scan while you repair all >>>>>>>>>>>>>>>>> > > > token ranges in parallel. I’m skeptical that’s going to >>>>>>>>>>>>>>>>> > > > be achievable in practice, but even if it is, we’re now >>>>>>>>>>>>>>>>> > > > talking about the view replica hypothetically doing a >>>>>>>>>>>>>>>>> > > > pairwise repair with every other replica in the cluster >>>>>>>>>>>>>>>>> > > > at the same time. Neither of these options is workable. >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > Let’s take a step back though, because I think we’re >>>>>>>>>>>>>>>>> > > > getting lost in the weeds. >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > The repair design in the CEP has some high level >>>>>>>>>>>>>>>>> > > > concepts that make a lot of sense, the idea of >>>>>>>>>>>>>>>>> > > > repairing a grid is really smart. However, it has some >>>>>>>>>>>>>>>>> > > > significant drawbacks that remain unaddressed. I want >>>>>>>>>>>>>>>>> > > > this CEP to succeed, and I know Jon does too, but the >>>>>>>>>>>>>>>>> > > > snapshot repair design is not a viable path forward. >>>>>>>>>>>>>>>>> > > > It’s the first iteration of a repair design. We’ve >>>>>>>>>>>>>>>>> > > > proposed a second iteration, and we’re open to a third >>>>>>>>>>>>>>>>> > > > iteration. This part of the CEP process is meant to >>>>>>>>>>>>>>>>> > > > identify and address shortcomings, I don’t think that >>>>>>>>>>>>>>>>> > > > continuing to dissect the snapshot repair design is >>>>>>>>>>>>>>>>> > > > making progress in that direction. >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: >>>>>>>>>>>>>>>>> > > > > > We potentially have to do it several times on each >>>>>>>>>>>>>>>>> > > > > > node, depending on the size of the range. Smaller >>>>>>>>>>>>>>>>> > > > > > ranges increase the size of the board >>>>>>>>>>>>>>>>> > > > > > exponentially, larger ranges increase the number of >>>>>>>>>>>>>>>>> > > > > > SSTables that would be involved in each compaction. >>>>>>>>>>>>>>>>> > > > > As described in the CEP example, this can be handled >>>>>>>>>>>>>>>>> > > > > in a single round of repair. We first identify all >>>>>>>>>>>>>>>>> > > > > the points in the grid that require repair, then >>>>>>>>>>>>>>>>> > > > > perform anti-compaction and stream data based on a >>>>>>>>>>>>>>>>> > > > > second scan over those identified points. This >>>>>>>>>>>>>>>>> > > > > applies to the snapshot-based solution—without an >>>>>>>>>>>>>>>>> > > > > index, repairing a single point in that grid requires >>>>>>>>>>>>>>>>> > > > > scanning the entire base table partition (token >>>>>>>>>>>>>>>>> > > > > range). In contrast, with the index-based solution—as >>>>>>>>>>>>>>>>> > > > > in the example you referenced—if a large block of >>>>>>>>>>>>>>>>> > > > > data is corrupted, even though the index is used for >>>>>>>>>>>>>>>>> > > > > comparison, many key mismatches may occur. This can >>>>>>>>>>>>>>>>> > > > > lead to random disk access to the original data >>>>>>>>>>>>>>>>> > > > > files, which could cause performance issues. For the >>>>>>>>>>>>>>>>> > > > > case you mentioned for snapshot based solution, it >>>>>>>>>>>>>>>>> > > > > should not take months to repair all the data, >>>>>>>>>>>>>>>>> > > > > instead one round of repair should be enough. The >>>>>>>>>>>>>>>>> > > > > actual repair phase is split from the detection phase. >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad >>>>>>>>>>>>>>>>> > > > > <j...@rustyrazorblade.com> wrote: >>>>>>>>>>>>>>>>> > > > >> > This isn’t really the whole story. The amount of >>>>>>>>>>>>>>>>> > > > >> > wasted scans on index repairs is negligible. If a >>>>>>>>>>>>>>>>> > > > >> > difference is detected with snapshot repairs >>>>>>>>>>>>>>>>> > > > >> > though, you have to read the entire partition from >>>>>>>>>>>>>>>>> > > > >> > both the view and base table to calculate what >>>>>>>>>>>>>>>>> > > > >> > needs to be fixed. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> You nailed it. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> When the base table is converted to a view, and sent >>>>>>>>>>>>>>>>> > > > >> to the view, the information we have is that one of >>>>>>>>>>>>>>>>> > > > >> the view's partition keys needs a repair. That's >>>>>>>>>>>>>>>>> > > > >> going to be different from the partition key of the >>>>>>>>>>>>>>>>> > > > >> base table. As a result, on the base table, for >>>>>>>>>>>>>>>>> > > > >> each affected range, we'd have to issue another >>>>>>>>>>>>>>>>> > > > >> compaction across the entire set of sstables that >>>>>>>>>>>>>>>>> > > > >> could have the data the view needs (potentially many >>>>>>>>>>>>>>>>> > > > >> GB), in order to send over the corrected version of >>>>>>>>>>>>>>>>> > > > >> the partition, then send it over to the view. >>>>>>>>>>>>>>>>> > > > >> Without an index in place, we have to do yet another >>>>>>>>>>>>>>>>> > > > >> scan, per-affected range. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> Consider the case of a single corrupted SSTable on >>>>>>>>>>>>>>>>> > > > >> the view that's removed from the filesystem, or the >>>>>>>>>>>>>>>>> > > > >> data is simply missing after being restored from an >>>>>>>>>>>>>>>>> > > > >> inconsistent backup. It presumably contains lots of >>>>>>>>>>>>>>>>> > > > >> partitions, which maps to base partitions all over >>>>>>>>>>>>>>>>> > > > >> the cluster, in a lot of different token ranges. >>>>>>>>>>>>>>>>> > > > >> For every one of those ranges (hundreds, to tens of >>>>>>>>>>>>>>>>> > > > >> thousands of them given the checkerboard design), >>>>>>>>>>>>>>>>> > > > >> when finding the missing data in the base, you'll >>>>>>>>>>>>>>>>> > > > >> have to perform a compaction across all the SSTables >>>>>>>>>>>>>>>>> > > > >> that potentially contain the missing data just to >>>>>>>>>>>>>>>>> > > > >> rebuild the view-oriented partitions that need to be >>>>>>>>>>>>>>>>> > > > >> sent to the view. The complexity of this operation >>>>>>>>>>>>>>>>> > > > >> can be looked at as O(N*M) where N and M are the >>>>>>>>>>>>>>>>> > > > >> number of ranges in the base table and the view >>>>>>>>>>>>>>>>> > > > >> affected by the corruption, respectively. Without >>>>>>>>>>>>>>>>> > > > >> an index in place, finding the missing data is very >>>>>>>>>>>>>>>>> > > > >> expensive. We potentially have to do it several >>>>>>>>>>>>>>>>> > > > >> times on each node, depending on the size of the >>>>>>>>>>>>>>>>> > > > >> range. Smaller ranges increase the size of the >>>>>>>>>>>>>>>>> > > > >> board exponentially, larger ranges increase the >>>>>>>>>>>>>>>>> > > > >> number of SSTables that would be involved in each >>>>>>>>>>>>>>>>> > > > >> compaction. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> Then you send that data over to the view, the view >>>>>>>>>>>>>>>>> > > > >> does it's anti-compaction thing, again, once per >>>>>>>>>>>>>>>>> > > > >> affected range. So now the view has to do an >>>>>>>>>>>>>>>>> > > > >> anti-compaction once per block on the board that's >>>>>>>>>>>>>>>>> > > > >> affected by the missing data. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> Doing hundreds or thousands of these will add up >>>>>>>>>>>>>>>>> > > > >> pretty quickly. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> When I said that a repair could take months, this is >>>>>>>>>>>>>>>>> > > > >> what I had in mind. >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston >>>>>>>>>>>>>>>>> > > > >> <bl...@ultrablake.com> wrote: >>>>>>>>>>>>>>>>> > > > >>> __ >>>>>>>>>>>>>>>>> > > > >>> > Adds overhead in the hot path due to maintaining >>>>>>>>>>>>>>>>> > > > >>> > indexes. Extra memory needed during write path >>>>>>>>>>>>>>>>> > > > >>> > and compaction. >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> I’d make the same argument about the overhead of >>>>>>>>>>>>>>>>> > > > >>> maintaining the index that Jon just made about the >>>>>>>>>>>>>>>>> > > > >>> disk space required. The relatively predictable >>>>>>>>>>>>>>>>> > > > >>> overhead of maintaining the index as part of the >>>>>>>>>>>>>>>>> > > > >>> write and compaction paths is a pro, not a con. >>>>>>>>>>>>>>>>> > > > >>> Although you’re not always paying the cost of >>>>>>>>>>>>>>>>> > > > >>> building a merkle tree with snapshot repair, it can >>>>>>>>>>>>>>>>> > > > >>> impact the hot path and you do have to plan for it. >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> > Verifies index content, not actual data—may miss >>>>>>>>>>>>>>>>> > > > >>> > low-probability errors like bit flips >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> Presumably this could be handled by the views >>>>>>>>>>>>>>>>> > > > >>> performing repair against each other? You could >>>>>>>>>>>>>>>>> > > > >>> also periodically rebuild the index or perform >>>>>>>>>>>>>>>>> > > > >>> checksums against the sstable content. >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> > Extra data scan during inconsistency detection >>>>>>>>>>>>>>>>> > > > >>> > Index: Since the data covered by certain indexes >>>>>>>>>>>>>>>>> > > > >>> > is not guaranteed to be fully contained within a >>>>>>>>>>>>>>>>> > > > >>> > single node as the topology changes, some data >>>>>>>>>>>>>>>>> > > > >>> > scans may be wasted. >>>>>>>>>>>>>>>>> > > > >>> > Snapshots: No extra data scan >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> This isn’t really the whole story. The amount of >>>>>>>>>>>>>>>>> > > > >>> wasted scans on index repairs is negligible. If a >>>>>>>>>>>>>>>>> > > > >>> difference is detected with snapshot repairs >>>>>>>>>>>>>>>>> > > > >>> though, you have to read the entire partition from >>>>>>>>>>>>>>>>> > > > >>> both the view and base table to calculate what >>>>>>>>>>>>>>>>> > > > >>> needs to be fixed. >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: >>>>>>>>>>>>>>>>> > > > >>>> One practical aspect that isn't immediately >>>>>>>>>>>>>>>>> > > > >>>> obvious is the disk space consideration for >>>>>>>>>>>>>>>>> > > > >>>> snapshots. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> When you have a table with a mixed workload using >>>>>>>>>>>>>>>>> > > > >>>> LCS or UCS with scaling parameters like L10 and >>>>>>>>>>>>>>>>> > > > >>>> initiate a repair, the disk usage will increase as >>>>>>>>>>>>>>>>> > > > >>>> long as the snapshot persists and the table >>>>>>>>>>>>>>>>> > > > >>>> continues to receive writes. This aspect is >>>>>>>>>>>>>>>>> > > > >>>> understood and factored into the design. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> However, a more nuanced point is the necessity to >>>>>>>>>>>>>>>>> > > > >>>> maintain sufficient disk headroom specifically for >>>>>>>>>>>>>>>>> > > > >>>> running repairs. This echoes the challenge with >>>>>>>>>>>>>>>>> > > > >>>> STCS compaction, where enough space must be >>>>>>>>>>>>>>>>> > > > >>>> available to accommodate the largest SSTables, >>>>>>>>>>>>>>>>> > > > >>>> even when they are not being actively compacted. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> For example, if a repair involves rewriting 100GB >>>>>>>>>>>>>>>>> > > > >>>> of SSTable data, you'll consistently need to >>>>>>>>>>>>>>>>> > > > >>>> reserve 100GB of free space to facilitate this. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> Therefore, while the snapshot-based approach leads >>>>>>>>>>>>>>>>> > > > >>>> to variable disk space utilization, operators must >>>>>>>>>>>>>>>>> > > > >>>> provision storage as if the maximum potential >>>>>>>>>>>>>>>>> > > > >>>> space will be used at all times to ensure repairs >>>>>>>>>>>>>>>>> > > > >>>> can be executed. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> This introduces a rate of churn dynamic, where the >>>>>>>>>>>>>>>>> > > > >>>> write throughput dictates the required extra disk >>>>>>>>>>>>>>>>> > > > >>>> space, rather than the existing on-disk data >>>>>>>>>>>>>>>>> > > > >>>> volume. >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> If 50% of your SSTables are rewritten during a >>>>>>>>>>>>>>>>> > > > >>>> snapshot, you would need 50% free disk space. >>>>>>>>>>>>>>>>> > > > >>>> Depending on the workload, the snapshot method >>>>>>>>>>>>>>>>> > > > >>>> could consume significantly more disk space than >>>>>>>>>>>>>>>>> > > > >>>> an index-based approach. Conversely, for >>>>>>>>>>>>>>>>> > > > >>>> relatively static workloads, the index method >>>>>>>>>>>>>>>>> > > > >>>> might require more space. It's not as >>>>>>>>>>>>>>>>> > > > >>>> straightforward as stating "No extra disk space >>>>>>>>>>>>>>>>> > > > >>>> needed". >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> Jon >>>>>>>>>>>>>>>>> > > > >>>> >>>>>>>>>>>>>>>>> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu >>>>>>>>>>>>>>>>> > > > >>>> <curly...@gmail.com> wrote: >>>>>>>>>>>>>>>>> > > > >>>>> > Regarding your comparison between approaches, I >>>>>>>>>>>>>>>>> > > > >>>>> > think you also need to take into account the >>>>>>>>>>>>>>>>> > > > >>>>> > other dimensions that have been brought up in >>>>>>>>>>>>>>>>> > > > >>>>> > this thread. Things like minimum repair times >>>>>>>>>>>>>>>>> > > > >>>>> > and vulnerability to outages and topology >>>>>>>>>>>>>>>>> > > > >>>>> > changes are the first that come to mind. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Sure, I added a few more points. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> *Perspective* >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> *Index-Based Solution* >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> *Snapshot-Based Solution* >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 1. Hot path overhead >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Adds overhead in the hot path due to maintaining >>>>>>>>>>>>>>>>> > > > >>>>> indexes. Extra memory needed during write path >>>>>>>>>>>>>>>>> > > > >>>>> and compaction. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> No impact on the hot path >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 2. Extra disk usage when repair is not running >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space to store >>>>>>>>>>>>>>>>> > > > >>>>> persistent indexes >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> No extra disk space needed >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 3. Extra disk usage during repair >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Minimal or no additional disk usage >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Requires additional disk space for snapshots >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 4. Fine-grained repair to deal with emergency >>>>>>>>>>>>>>>>> > > > >>>>> situations / topology changes >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Supports fine-grained repairs by targeting >>>>>>>>>>>>>>>>> > > > >>>>> specific index ranges. This allows repair to be >>>>>>>>>>>>>>>>> > > > >>>>> retried on smaller data sets, enabling >>>>>>>>>>>>>>>>> > > > >>>>> incremental progress when repairing the entire >>>>>>>>>>>>>>>>> > > > >>>>> table. This is especially helpful when there are >>>>>>>>>>>>>>>>> > > > >>>>> down nodes or topology changes during repair, >>>>>>>>>>>>>>>>> > > > >>>>> which are common in day-to-day operations. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Coordination across all nodes is required over a >>>>>>>>>>>>>>>>> > > > >>>>> long period of time. For each round of repair, if >>>>>>>>>>>>>>>>> > > > >>>>> all replica nodes are down or if there is a >>>>>>>>>>>>>>>>> > > > >>>>> topology change, the data ranges that were not >>>>>>>>>>>>>>>>> > > > >>>>> covered will need to be repaired in the next >>>>>>>>>>>>>>>>> > > > >>>>> round. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 5. Validating data used in reads directly >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Verifies index content, not actual data—may miss >>>>>>>>>>>>>>>>> > > > >>>>> low-probability errors like bit flips >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Verifies actual data content, providing stronger >>>>>>>>>>>>>>>>> > > > >>>>> correctness guarantees >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 6. Extra data scan during inconsistency detection >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Since the data covered by certain indexes is not >>>>>>>>>>>>>>>>> > > > >>>>> guaranteed to be fully contained within a single >>>>>>>>>>>>>>>>> > > > >>>>> node as the topology changes, some data scans may >>>>>>>>>>>>>>>>> > > > >>>>> be wasted. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> No extra data scan >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> 7. The overhead of actual data repair after an >>>>>>>>>>>>>>>>> > > > >>>>> inconsistency is detected >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Only indexes are streamed to the base table node, >>>>>>>>>>>>>>>>> > > > >>>>> and the actual data being fixed can be as >>>>>>>>>>>>>>>>> > > > >>>>> accurate as the row level. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Anti-compaction is needed on the MV table, and >>>>>>>>>>>>>>>>> > > > >>>>> potential over-streaming may occur due to the >>>>>>>>>>>>>>>>> > > > >>>>> lack of row-level insight into data quality. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> > one of my biggest concerns I haven't seen >>>>>>>>>>>>>>>>> > > > >>>>> > discussed much is LOCAL_SERIAL/SERIAL on read >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> Paxos v2 introduces an optimization where serial >>>>>>>>>>>>>>>>> > > > >>>>> reads can be completed in just one round trip, >>>>>>>>>>>>>>>>> > > > >>>>> reducing latency compared to traditional Paxos >>>>>>>>>>>>>>>>> > > > >>>>> which may require multiple phases. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> > I think a refresh would be low-cost and give >>>>>>>>>>>>>>>>> > > > >>>>> > users the flexibility to run them however they >>>>>>>>>>>>>>>>> > > > >>>>> > want. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>>>> I think this is an interesting idea. Does it >>>>>>>>>>>>>>>>> > > > >>>>> suggest that the MV should be rebuilt on a >>>>>>>>>>>>>>>>> > > > >>>>> regular schedule? It sounds like an extension of >>>>>>>>>>>>>>>>> > > > >>>>> the snapshot-based approach—rather than detecting >>>>>>>>>>>>>>>>> > > > >>>>> mismatches, we would periodically reconstruct a >>>>>>>>>>>>>>>>> > > > >>>>> clean version of the MV based on the snapshot. >>>>>>>>>>>>>>>>> > > > >>>>> This seems to diverge from the current MV model >>>>>>>>>>>>>>>>> > > > >>>>> in Cassandra, where consistency between the MV >>>>>>>>>>>>>>>>> > > > >>>>> and base table must be maintained continuously. >>>>>>>>>>>>>>>>> > > > >>>>> This could be an extension of the CEP-48 work, >>>>>>>>>>>>>>>>> > > > >>>>> where the MV is periodically rebuilt from a >>>>>>>>>>>>>>>>> > > > >>>>> snapshot of the base table, assuming the user can >>>>>>>>>>>>>>>>> > > > >>>>> tolerate some level of staleness in the MV data. >>>>>>>>>>>>>>>>> > > > >>>>> >>>>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>>