>What if the CEP includes an interface for MV repair that calls out to some
user pluggable solution and the spark-based solution you've developed is
the first / only reference solution available at the outset? That way we
could integrate it into the control plane (nodetool, JMX, a post CEP-37
world), have a solution available for those comfortable taking on the spark
dependency, and have a clear paved path for future anti-entropy work on
where to fit into the ecosystem.

It is acceptable to include an orchestration mechanism as a stretch goal of
this CEP, which would provide the following capabilities:

   1.

   Trigger MV Repair: Invoke a pluggable JAR to start the MV repair process
   via a method such as:

String invokeMVRepair(String keyspace, String baseTable, String mvTable,
List<Object> options)

   2.

   Track Job Status:  Expose an API to query the status of the repair job:

JobStatus jobStatus(String ID)

This orchestration framework could be built on top of the existing
AutoRepair infrastructure from CEP-37, and extended within Cassandra to
manage MV-specific repair workflows. It would process one or more MVs at a
time by invoking the repair via API #1 and tracking the status via API #2. The
default implementation can be a noop, allowing users to plug in their own
logic. This design leaves room for integration with external repair
solutions, such as the cassandra-mv-repair-spark-job
<https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>, or for
users to build custom repair mechanisms.
However, it's important to note that the cassandra-mv-repair-spark-job
<https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> itself
is explicitly out of scope for this CEP, as it has already been addressed
in a separate discussion thread
<https://lists.apache.org/thread/d3qo3vjxn4116htf175yzcg94s6jq07d>.

Jaydeep

On Fri, Aug 1, 2025 at 6:42 AM Josh McKenzie <jmcken...@apache.org> wrote:

> Definitely want to avoid scope creep, *however*... ;)
>
> What if the CEP includes an interface for MV repair that calls out to some
> user pluggable solution and the spark-based solution you've developed is
> the first / only reference solution available at the outset? That way we
> could integrate it into the control plane (nodetool, JMX, a post CEP-37
> world), have a solution available for those comfortable taking on the spark
> dependency, and have a clear paved path for future anti-entropy work on
> where to fit into the ecosystem.
>
> On Thu, Jul 31, 2025, at 5:20 PM, Runtian Liu wrote:
>
> Based on our discussion, it seems we’ve reached broad agreement on the
> hot‑path optimization for strict MV consistency mode. We still have
> concerns about the different approaches for the repair path. I’ve updated
> the CEP’s scope and retitled it *‘Cassandra Materialized View
> Consistency, Reliability & Backfill Enhancement’.* I removed the repair
> section and added a backfill section that proposes an optimized strategy to
> backfill MVs from the base table. Please take a look and share your
> thoughts. I believe this backfill strategy will also benefit future MV
> repair work.
>
> Regarding repair,
>
> While repair is essential to maintain Materialized View (MV) consistency
> in the face of one-off bugs or bit rot, implementing a robust, native
> repair mechanism within Apache Cassandra remains a highly complex and still
> largely theoretical challenge, irrespective of the strategies considered in
> prior discussions.
>
> To ensure steady progress, we propose addressing the problem
> incrementally. This CEP focuses on making the hot-path & backfill reliable,
> laying a solid foundation. A future CEP can then address repair mechanisms
> within the Apache Cassandra ecosystem in a more focused and well-scoped
> manner.
> For now, the proposed CEP clarifies that users may choose to either
> evaluate the external Spark-based repair tool
> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
> (which is not part of this proposal) or develop their own repair solution
> tailored to their operational needs.
>
> On Thu, Jul 3, 2025 at 4:32 PM Runtian Liu <curly...@gmail.com> wrote:
>
> I believe that, with careful design, we could make the row-level repair
> work using the index-based approach.
> However, regarding the CEP-proposed solution, I want to highlight that the
> entire repair job can be divided into two parts:
>
>    1.
>
>    Inconsistency detection
>    2.
>
>    Rebuilding the inconsistent ranges
>
> The second step could look like this:
>
> nodetool mv_rebuild --base_table_range <b_start,b_end> --mv_ranges
> [<mv_start1, mv_end1>, <mv_start2, mv_end2>, ...]
>
>
> The base table range and MV ranges would come from the first step. But
> it’s also possible to run this rebuild directly—without detecting
> inconsistencies—if we already know that some nodes need repair. This means
> we don’t have to wait for the snapshot processing to start the repair.
> Snapshot-based detection works well when the cluster is generally healthy.
>
> Compared to the current rebuild, the proposed approach doesn’t require
> dropping the MV and rebuilding it from scratch across the entire
> cluster—which is a major blocker for production environments.
>
> That said, for the index-based approach, I still think it introduces
> additional load on both the write path and compaction. This increases the
> resource cost of using MVs. While it’s a nice feature to enable row-level
> MV repair, its implementation is complex. On the other hand, the original
> proposal’s rebuild strategy is more versatile and likely to perform
> better—especially when only a few nodes need repair. In contrast,
> large-scale row-level comparisons via the index-based approach could be
> prohibitively expensive. Just to clarify my intention here: this is not a
> complete 'no' to the index-based approach. However, for the initial
> version, I believe it's more prudent to avoid impacting the hot path. We
> can start with a simpler design that keeps the critical path untouched, and
> then iterate to make it more robust over time—much like how other features
> in Apache Cassandra have evolved. For example, 'full repair' came first,
> followed by 'incremental repair'; similarly, STCS was later complemented by
> LCS. This phased evolution allows us to balance safety, stability, and
> long-term capability.
>
> Given all this, I still prefer to move forward with the original proposal.
> It allows us to avoid paying the repair overhead most of the time.
>
>
> On Tue, Jun 24, 2025 at 3:25 PM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> Those are both fair points. Once you start dealing with data loss though,
> maintaining guarantees is often impossible, so I’m not sure that torn
> writes or updated timestamps are dealbreakers, but I’m fine with tabling
> option 2 for now and seeing if we can figure out something better.
>
> Regarding the assassin cells, if you wanted to avoid explicitly agreeing
> on a value, you might be able to only issue them for repaired base data,
> which has been implicitly agreed upon.
>
> I think that or something like it is worth exploring. The idea would be to
> solve this issue as completely as anti-compaction would - but without
> having to rewrite sstables. I’d be interested to hear any ideas you have
> about how that might work.
>
> You basically need a mechanism to erase some piece of data that was
> written before a given wall clock time - regardless of cell timestamp, and
> without precluding future updates (in wall clock time) with earlier
> timestamps.
>
> On Mon, Jun 23, 2025, at 4:28 PM, Runtian Liu wrote:
>
> In the second option, we use the repair timestamp to re-update any cell or
> row we want to fix in the base table. This approach is problematic because
> it alters the write time of user-supplied data. Although Cassandra does not
> allow users to set timestamps for LWT writes, users may still rely on the
> update time. A key limitation of this approach is that it cannot fix cases
> where a view cell ends up in a future state while the base table remains
> correct. I now understand your point that Cassandra cannot handle this
> scenario today. However, as I mentioned earlier, the important distinction
> is that when this issue occurs in the base table, we accept the "incorrect"
> data as valid—but this is not acceptable for materialized views, since the
> source of truth (the base table) still holds the correct data.
>
> On Mon, Jun 23, 2025 at 12:05 PM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> > Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> No worries, I’ll be taking some time off soon as well.
>
> > I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> I agree the first option probably isn’t the right way to go. Could you say
> a bit more about why the second option is not a good approach and which
> problems it won’t solve?
>
> On Sun, Jun 22, 2025, at 6:09 PM, Runtian Liu wrote:
>
> Sorry, Blake—I was traveling last week and couldn’t reply to your email
> sooner.
>
> > First - we interpret view data with higher timestamps than the base
> table as data that’s missing from the base and replicate it into the base
> table. The timestamp of the missing data may be below the paxos timestamp
> low bound so we’d have to adjust the paxos coordination logic to allow that
> in this case. Depending on how the view got this way it may also tear
> writes to the base table, breaking the write atomicity promise.
>
> As discussed earlier, we want this MV repair mechanism to handle all edge
> cases. However, it would be difficult to design it in a way that detects
> the root cause of each mismatch and repairs it accordingly. Additionally,
> as you mentioned, this approach could introduce other issues, such as
> violating the write atomicity guarantee.
>
> > Second - If this happens it means that we’ve either lost base table data
> or paxos metadata. If that happened, we could force a base table update
> that rewrites the current base state with new timestamps making the extra
> view data removable. However this wouldn’t fix the case where the view cell
> has a timestamp from the future - although that’s not a case that C* can
> fix today either.
>
> I don’t think the first or second option is ideal. We should treat the
> base table as the source of truth. Modifying it—or forcing an update on it,
> even if it’s just a timestamp change—is not a good approach and won’t solve
> all problems.
>
> > the idea to use anti-compaction makes a lot more sense now (in principle
> - I don’t think it’s workable in practice)
>
> I have one question regarding anti-compaction. Is the main concern that
> processing too much data during anti-compaction could cause issues for the
> cluster?
>
> > I guess you could add some sort of assassin cell that is meant to remove
> a cell with a specific timestamp and value, but is otherwise invisible.
>
> The idea of the assassination cell is interesting. To prevent data from
> being incorrectly removed during the repair process, we need to ensure a
> quorum of nodes is available and agrees on the same value before repairing
> a materialized view (MV) row or cell. However, this could be very
> expensive, as it requires coordination to repair even a single row.
>
> I think there are a few key differences between MV repair and normal
> anti-entropy repair:
>
>    1.
>
>    For normal anti-entropy repair, there is no single source of truth.
>    All replicas act as sources of truth, and eventual consistency is
>    achieved by merging data across them. Even if some data is lost or bugs
>    result in future timestamps, the replicas will eventually converge on the
>    same (possibly incorrect) value, and that value becomes the accepted truth.
>    In contrast, MV repair relies on the base table as the source of truth and
>    modifies the MV if inconsistencies are detected. This means that in some
>    cases, simply merging data won’t resolve the issue, since Cassandra
>    resolves conflicts using timestamps and there’s no guarantee the base table
>    will always "win" unless we change the MV merging logic—which I’ve been
>    trying to avoid.
>    2.
>
>    I see MV repair as more of an on-demand operation, whereas normal
>    anti-entropy repair needs to run regularly. This means we shouldn’t treat
>    MV repair the same way as existing repairs. When an operator initiates MV
>    repair, they need to ensure that sufficient resources are available to
>    support it.
>
>
> On Thu, Jun 12, 2025 at 8:53 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> Got it, thanks for clearing that up. I’d seen the strict liveness code
> around but didn’t realize it was MV related and hadn’t dug into what it did
> or how it worked.
>
> I think you’re right about the row liveness update working for extra data
> with timestamps lower than the most recent base table update.
>
> I see what you mean about the timestamp from the future case. I thought of
> 3 options:
>
> First - we interpret view data with higher timestamps than the base table
> as data that’s missing from the base and replicate it into the base table.
> The timestamp of the missing data may be below the paxos timestamp low
> bound so we’d have to adjust the paxos coordination logic to allow that in
> this case. Depending on how the view got this way it may also tear writes
> to the base table, breaking the write atomicity promise.
>
> Second - If this happens it means that we’ve either lost base table data
> or paxos metadata. If that happened, we could force a base table update
> that rewrites the current base state with new timestamps making the extra
> view data removable. However this wouldn’t fix the case where the view cell
> has a timestamp from the future - although that’s not a case that C* can
> fix today either.
>
> Third - we add a new tombstone type or some mechanism to delete specific
> cells that doesn’t preclude correct writes with lower timestamps from being
> visible. I’m not sure how this would work, and the idea to use
> anti-compaction makes a lot more sense now (in principle - I don’t think
> it’s workable in practice). I guess you could add some sort of assassin
> cell that is meant to remove a cell with a specific timestamp and value,
> but is otherwise invisible. This seems dangerous though, since it’s likely
> there’s a replication problem that may resolve itself and the repair
> process would actually be removing data that the user intended to write.
>
> Paulo - I don’t think storage changes are off the table, but they do
> expand the scope and risk of the proposal, so we should be careful.
>
> On Wed, Jun 11, 2025, at 4:44 PM, Paulo Motta wrote:
>
>  > I’m not sure if this is the only edge case—there may be other issues as
> well. I’m also unsure whether we should redesign the tombstone handling for
> MVs, since that would involve changes to the storage engine. To minimize
> impact there, the original proposal was to rebuild the affected ranges
> using anti-compaction, just to be safe.
>
> I haven't been following the discussion but I think one of the issues with
> the materialized view "strict liveness" fix[1] is that we avoided making
> invasive changes to the storage engine at the time, but this was considered
> by Zhao on [1]. I think we shouldn't be trying to avoid updates to the
> storage format as part of the MV implementation, if this is what it takes
> to make MVs V2 reliable.
>
> [1] -
> https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16101603&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16101603
>
> On Wed, Jun 11, 2025 at 7:02 PM Runtian Liu <curly...@gmail.com> wrote:
>
> The current design leverages strict liveness to shadow the old view row.
> When the view-indexed value changes from 'a' to 'b', no tombstone is
> written; instead, the old row is marked as expired by updating its liveness
> info with the timestamp of the change. If the column is later set back to
> 'a', the view row is re-inserted with a new, non-expired liveness info
> reflecting the latest timestamp.
>
> To delete an extra row in the materialized view (MV), we can likely use
> the same approach—marking it as shadowed by updating the liveness info.
> However, in the case of inconsistencies where a column in the MV has a
> higher timestamp than the corresponding column in the base table, this
> row-level liveness mechanism is insufficient.
>
> Even for the case where we delete the row by marking its liveness info as
> expired during repair, there are concerns. Since this introduces a data
> mutation as part of the repair process, it’s unclear whether there could be
> edge cases we’re missing. This approach may risk unexpected side effects if
> the repair logic is not carefully aligned with write path semantics.
>
> On Thu, Jun 12, 2025 at 3:59 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> That’s a good point, although as described I don’t think that could ever
> work properly, even in normal operation. Either we’re misunderstanding
> something, or this is a flaw in the current MV design.
>
> Assuming changing the view indexed column results in a tombstone being
> applied to the view row for the previous value, if we wrote the other base
> columns (the non view indexed ones) to the view with the same timestamps
> they have on the base, then changing the view indexed value from ‘a’ to
> ‘b’, then back to ‘a’ would always cause this problem. I think you’d need
> to always update the column timestamps on the view to be >= the view column
> timestamp on the base
>
> On Tue, Jun 10, 2025, at 11:38 PM, Runtian Liu wrote:
>
> > In the case of a missed update, we'll have a new value and we can send
> a tombstone to the view with the timestamp of the most recent update.
>
> > then something has gone wrong and we should issue a tombstone using the
> paxos repair timestamp as the tombstone timestamp.
>
> The current MV implementation uses “strict liveness” to determine whether
> a row is live. I believe that using regular tombstones during repair could
> cause problems. For example, consider a base table with schema (pk, ck, v1,
> v2) and a materialized view with schema (v1, pk, ck) -> v2. If, for some
> reason, we detect an extra row in the MV and delete it using a tombstone
> with the latest update timestamp, we may run into issues. Suppose we later
> update the base table’s v1 field to match the MV row we previously deleted,
> and the v2 value now has an older timestamp. In that case, the previously
> issued tombstone could still shadow the v2 column, which is unintended.
> That is why I was asking if we are going to introduce a new kind of
> tombstones. I’m not sure if this is the only edge case—there may be other
> issues as well. I’m also unsure whether we should redesign the tombstone
> handling for MVs, since that would involve changes to the storage engine.
> To minimize impact there, the original proposal was to rebuild the affected
> ranges using anti-compaction, just to be safe.
>
> On Wed, Jun 11, 2025 at 1:20 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
>  Extra row in MV (assuming the tombstone is gone in the base table) — how
> should we fix this?
>
>
>
> This would mean that the base table had either updated or deleted a row
> and the view didn't receive the corresponding delete.
>
> In the case of a missed update, we'll have a new value and we can send a
> tombstone to the view with the timestamp of the most recent update. Since
> timestamps issued by paxos and accord writes are always increasing
> monotonically and don't have collisions, this is safe.
>
> In the case of a row deletion, we'd also want to send a tombstone with the
> same timestamp, however since tombstones can be purged, we may not have
> that information and would have to treat it like the view has a higher
> timestamp than the base table.
>
> Inconsistency (timestamps don’t match) — it’s easy to fix when the base
> table has higher timestamps, but how do we resolve it when the MV columns
> have higher timestamps?
>
>
> There are 2 ways this could happen. First is that a write failed and paxos
> repair hasn't completed it, which is expected, and the second is a
> replication bug or base table data loss. You'd need to compare the view
> timestamp to the paxos repair history to tell which it is. If the view
> timestamp is higher than the most recent paxos repair timestamp for the
> key, then it may just be a failed write and we should do nothing. If the
> view timestamp is less than the most recent paxos repair timestamp for that
> key and higher than the base timestamp, then something has gone wrong and
> we should issue a tombstone using the paxos repair timestamp as the
> tombstone timestamp. This is safe to do because the paxos repair timestamps
> act as a low bound for ballots paxos will process, so it wouldn't be
> possible for a legitimate write to be shadowed by this tombstone.
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
>
> No, a normal tombstone would work.
>
> On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
>
> Okay, let’s put the efficiency discussion on hold for now. I want to make
> sure the actual repair process after detecting inconsistencies will work
> with the index-based solution.
>
> When a mismatch is detected, the MV replica will need to stream its index
> file to the base table replica. The base table will then perform a
> comparison between the two files.
>
> There are three cases we need to handle:
>
>    1.
>
>    Missing row in MV — this is straightforward; we can propagate the data
>    to the MV.
>    2.
>
>    Extra row in MV (assuming the tombstone is gone in the base table) —
>    how should we fix this?
>    3.
>
>    Inconsistency (timestamps don’t match) — it’s easy to fix when the
>    base table has higher timestamps, but how do we resolve it when the MV
>    columns have higher timestamps?
>
> Do we need to introduce a new kind of tombstone to shadow the rows in the
> MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
> should we fix the MV data?
>
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve been making good progress
>
> > My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically.
>
> So the additional overhead of maintaining a storage attached index on the
> client write path is pretty minimal - it’s basically adding data to an in
> memory trie. It’s a little extra work and memory usage, but there isn’t any
> extra io or other blocking associated with it. I’d expect the latency
> impact to be negligible.
>
> > As mentioned earlier, this MV repair should be an infrequent operation
>
> I don’t this that’s a safe assumption. There are a lot of situations
> outside of data loss bugs where repair would need to be run.
>
> These use cases could probably be handled by repairing the view with other
> view replicas:
>
> Scrubbing corrupt sstables
> Node replacement via backup
>
> These use cases would need an actual MV repair to check consistency with
> the base table:
>
> Restoring a cluster from a backup
> Imported sstables via nodetool import
> Data loss from operator error
> Proactive consistency checks - ie preview repairs
>
> Even if it is an infrequent operation, when operators need it, it needs to
> be available and reliable.
>
> It’s a fact that there are clusters where non-incremental repairs are run
> on a cadence of a week or more to manage the overhead of validation
> compactions. Assuming the cluster doesn’t have any additional headroom,
> that would mean that any one of the above events could cause views to
> remain out of sync for up to a week while the full set of merkle trees is
> being built.
>
> This delay eliminates a lot of the value of repair as a risk mitigation
> tool. If I had to make a recommendation where a bad call could cost me my
> job, the prospect of a 7 day delay on repair would mean a strong no.
>
> Some users also run preview repair continuously to detect data consistency
> errors, so at least a subset of users will probably be running MV repairs
> continuously - at least in preview mode.
>
> That’s why I say that the replication path should be designed to never
> need repair, and MV repair should be designed to be prepared for the worst.
>
> > I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> I think this would be a really reasonable compromise as long as the
> default is on. That way it’s as safe as possible by default, but users who
> don’t care or have a separate system for repairing MVs can opt out.
>
> > I’m not sure what you mean by “data problems” here.
>
> I mean out of sync views - either due to bugs, operator error, corruption,
> etc
>
> > Also, this does scale with cluster size—I’ve compared it to full repair,
> and this MV repair should behave similarly. That means as long as full
> repair works, this repair should work as well.
>
> You could build the merkle trees at about the same cost as a full repair,
> but the actual data repair path is completely different for MV, and that’s
> the part that doesn’t scale well. As you know, with normal repair, we just
> stream data for ranges detected as out of sync. For Mvs, since the data
> isn’t in base partition order, the view data for an out of sync view range
> needs to be read out and streamed to every base replica that it’s detected
> a mismatch against. So in the example I gave with the 300 node cluster,
> you’re looking at reading and transmitting the same partition at least 100
> times in the best case, and the cost of this keeps going up as the cluster
> increases in size. That's the part that doesn't scale well.
>
> This is also one the benefits of the index design. Since it stores data in
> segments that roughly correspond to points on the grid, you’re not
> rereading the same data over and over. A repair for a given grid point only
> reads an amount of data proportional to the data in common for the
> base/view grid point, and it’s stored in a small enough granularity that
> the base can calculate what data needs to be sent to the view without
> having to read the entire view partition.
>
> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>
> Thanks, Blake. I’m open to iterating on the design, and hopefully we can
> come up with a solution that everyone agrees on.
>
> My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically. As mentioned
> earlier, this MV repair should be an infrequent operation, but the
> index-based approach shifts some of the work to the hot path in order to
> allow repairs that touch only a few nodes.
>
> I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> > it degrades operators ability to react to data problems by imposing a
> significant upfront processing burden on repair, and that it doesn’t scale
> well with cluster size
>
> I’m not sure what you mean by “data problems” here. Also, this does scale
> with cluster size—I’ve compared it to full repair, and this MV repair
> should behave similarly. That means as long as full repair works, this
> repair should work as well.
>
> For example, regardless of how large the cluster is, you can always enable
> Merkle tree building on 10% of the nodes at a time until all the trees are
> ready.
>
> I understand that coordinating this type of repair is harder than what we
> currently support, but with CEP-37, we should be able to handle this
> coordination without adding too much burden on the operator side.
>
> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work.
>
>
> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a
> competition. We can collaborate and figure out the best approach to the
> problem. I’d like to keep discussing it if you’re open to iterating on the
> design.
>
> I’m not married to our proposal, it’s just the cleanest way we could think
> of to address what Jon and I both see as blockers in the current proposal.
> It’s not set in stone though.
>
> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>
> Hmm, I am very surprised as I helped write that and I distinctly recall a
> specific goal was avoiding binding vetoes as they're so toxic.
>
> Ok, I guess you can block this work if you like.
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work. That isn't a functioning
> community in my mind, so I'll be noping out for a while, as I don't see
> much value here right now.
>
>
> On 2025/06/06 18:31:08 Blake Eggleston wrote:
> > Hi Benedict, that’s actually not true.
> >
> > Here’s a link to the project governance page: _https://
> cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> >
> > The CEP section says:
> >
> > “*Once the proposal is finalized and any major committer dissent
> reconciled, call a [VOTE] on the ML to have the proposal adopted. The
> criteria for acceptance is consensus (3 binding +1 votes and no binding
> vetoes). The vote should remain open for 72 hours.*”
> >
> > So they’re definitely vetoable.
> >
> > Also note the part about “*Once the proposal is finalized and any major
> committer dissent reconciled,*” being a prerequisite for moving a CEP to
> [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even
> be appropriate to move to a VOTE until we get to the bottom of this repair
> discussion.
> >
> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > > but the snapshot repair design is not a viable path forward. It’s
> the first iteration of a repair design. We’ve proposed a second iteration,
> and we’re open to a third iteration.
> > >
> > > I shan't be participating further in discussion, but I want to make a
> point of order. The CEP process has no vetoes, so you are not empowered to
> declare that a design is not viable without the input of the wider
> community.
> > >
> > >
> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > > > You can detect and fix the mismatch in a single round of repair, but
> the amount of work needed to do it is _significantly_ higher with snapshot
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where
> each view partition contains entries mapping to every token range in the
> cluster - so 100 ranges. If we lose a view sstable, it will affect an
> entire row/column of the grid. Repair is going to scan all data in the
> mismatching view token ranges 100 times, and each base range once. So
> you’re looking at 200 range scans.
> > > >
> > > > Now, you may argue that you can merge the duplicate view scans into
> a single scan while you repair all token ranges in parallel. I’m skeptical
> that’s going to be achievable in practice, but even if it is, we’re now
> talking about the view replica hypothetically doing a pairwise repair with
> every other replica in the cluster at the same time. Neither of these
> options is workable.
> > > >
> > > > Let’s take a step back though, because I think we’re getting lost in
> the weeds.
> > > >
> > > > The repair design in the CEP has some high level concepts that make
> a lot of sense, the idea of repairing a grid is really smart. However, it
> has some significant drawbacks that remain unaddressed. I want this CEP to
> succeed, and I know Jon does too, but the snapshot repair design is not a
> viable path forward. It’s the first iteration of a repair design. We’ve
> proposed a second iteration, and we’re open to a third iteration. This part
> of the CEP process is meant to identify and address shortcomings, I don’t
> think that continuing to dissect the snapshot repair design is making
> progress in that direction.
> > > >
> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > > > >  We potentially have to do it several times on each node,
> depending on the size of the range. Smaller ranges increase the size of the
> board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
> > > > > As described in the CEP example, this can be handled in a single
> round of repair. We first identify all the points in the grid that require
> repair, then perform anti-compaction and stream data based on a second scan
> over those identified points. This applies to the snapshot-based
> solution—without an index, repairing a single point in that grid requires
> scanning the entire base table partition (token range). In contrast, with
> the index-based solution—as in the example you referenced—if a large block
> of data is corrupted, even though the index is used for comparison, many
> key mismatches may occur. This can lead to random disk access to the
> original data files, which could cause performance issues. For the case you
> mentioned for snapshot based solution, it should not take months to repair
> all the data, instead one round of repair should be enough. The actual
> repair phase is split from the detection phase.
> > > > >
> > > > >
> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <
> j...@rustyrazorblade.com> wrote:
> > > > >> > This isn’t really the whole story. The amount of wasted scans
> on index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > > > >>
> > > > >> You nailed it.
> > > > >>
> > > > >> When the base table is converted to a view, and sent to the view,
> the information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
> > > > >>
> > > > >> Consider the case of a single corrupted SSTable on the view
> that's removed from the filesystem, or the data is simply missing after
> being restored from an inconsistent backup.  It presumably contains lots of
> partitions, which maps to base partitions all over the cluster, in a lot of
> different token ranges.  For every one of those ranges (hundreds, to tens
> of thousands of them given the checkerboard design), when finding the
> missing data in the base, you'll have to perform a compaction across all
> the SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be sent to the view.  The complexity
> of this operation can be looked at as O(N*M) where N and M are the number
> of ranges in the base table and the view affected by the corruption,
> respectively.  Without an index in place, finding the missing data is very
> expensive.  We potentially have to do it several times on each node,
> depending on the size of the range.  Smaller ranges increase the size of
> the board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
> > > > >>
> > > > >> Then you send that data over to the view, the view does it's
> anti-compaction thing, again, once per affected range.  So now the view has
> to do an anti-compaction once per block on the board that's affected by the
> missing data.
> > > > >>
> > > > >> Doing hundreds or thousands of these will add up pretty quickly.
> > > > >>
> > > > >> When I said that a repair could take months, this is what I had
> in mind.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <
> bl...@ultrablake.com> wrote:
> > > > >>> __
> > > > >>> > Adds overhead in the hot path due to maintaining indexes.
> Extra memory needed during write path and compaction.
> > > > >>>
> > > > >>> I’d make the same argument about the overhead of maintaining the
> index that Jon just made about the disk space required. The relatively
> predictable overhead of maintaining the index as part of the write and
> compaction paths is a pro, not a con. Although you’re not always paying the
> cost of building a merkle tree with snapshot repair, it can impact the hot
> path and you do have to plan for it.
> > > > >>>
> > > > >>> > Verifies index content, not actual data—may miss
> low-probability errors like bit flips
> > > > >>>
> > > > >>> Presumably this could be handled by the views performing repair
> against each other? You could also periodically rebuild the index or
> perform checksums against the sstable content.
> > > > >>>
> > > > >>> > Extra data scan during inconsistency detection
> > > > >>> > Index: Since the data covered by certain indexes is not
> guaranteed to be fully contained within a single node as the topology
> changes, some data scans may be wasted.
> > > > >>> > Snapshots: No extra data scan
> > > > >>>
> > > > >>> This isn’t really the whole story. The amount of wasted scans on
> index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > > > >>>
> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
> > > > >>>> One practical aspect that isn't immediately obvious is the disk
> space consideration for snapshots.
> > > > >>>>
> > > > >>>> When you have a table with a mixed workload using LCS or UCS
> with scaling parameters like L10 and initiate a repair, the disk usage will
> increase as long as the snapshot persists and the table continues to
> receive writes. This aspect is understood and factored into the design.
> > > > >>>>
> > > > >>>> However, a more nuanced point is the necessity to maintain
> sufficient disk headroom specifically for running repairs. This echoes the
> challenge with STCS compaction, where enough space must be available to
> accommodate the largest SSTables, even when they are not being actively
> compacted.
> > > > >>>>
> > > > >>>> For example, if a repair involves rewriting 100GB of SSTable
> data, you'll consistently need to reserve 100GB of free space to facilitate
> this.
> > > > >>>>
> > > > >>>> Therefore, while the snapshot-based approach leads to variable
> disk space utilization, operators must provision storage as if the maximum
> potential space will be used at all times to ensure repairs can be executed.
> > > > >>>>
> > > > >>>> This introduces a rate of churn dynamic, where the write
> throughput dictates the required extra disk space, rather than the existing
> on-disk data volume.
> > > > >>>>
> > > > >>>> If 50% of your SSTables are rewritten during a snapshot, you
> would need 50% free disk space. Depending on the workload, the snapshot
> method could consume significantly more disk space than an index-based
> approach. Conversely, for relatively static workloads, the index method
> might require more space. It's not as straightforward as stating "No extra
> disk space needed".
> > > > >>>>
> > > > >>>> Jon
> > > > >>>>
> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com>
> wrote:
> > > > >>>>> > Regarding your comparison between approaches, I think you
> also need to take into account the other dimensions that have been brought
> up in this thread. Things like minimum repair times and vulnerability to
> outages and topology changes are the first that come to mind.
> > > > >>>>>
> > > > >>>>> Sure, I added a few more points.
> > > > >>>>>
> > > > >>>>> *Perspective*
> > > > >>>>>
> > > > >>>>> *Index-Based Solution*
> > > > >>>>>
> > > > >>>>> *Snapshot-Based Solution*
> > > > >>>>>
> > > > >>>>> 1. Hot path overhead
> > > > >>>>>
> > > > >>>>> Adds overhead in the hot path due to maintaining indexes.
> Extra memory needed during write path and compaction.
> > > > >>>>>
> > > > >>>>> No impact on the hot path
> > > > >>>>>
> > > > >>>>> 2. Extra disk usage when repair is not running
> > > > >>>>>
> > > > >>>>> Requires additional disk space to store persistent indexes
> > > > >>>>>
> > > > >>>>> No extra disk space needed
> > > > >>>>>
> > > > >>>>> 3. Extra disk usage during repair
> > > > >>>>>
> > > > >>>>> Minimal or no additional disk usage
> > > > >>>>>
> > > > >>>>> Requires additional disk space for snapshots
> > > > >>>>>
> > > > >>>>> 4. Fine-grained repair  to deal with emergency situations /
> topology changes
> > > > >>>>>
> > > > >>>>> Supports fine-grained repairs by targeting specific index
> ranges. This allows repair to be retried on smaller data sets, enabling
> incremental progress when repairing the entire table. This is especially
> helpful when there are down nodes or topology changes during repair, which
> are common in day-to-day operations.
> > > > >>>>>
> > > > >>>>> Coordination across all nodes is required over a long period
> of time. For each round of repair, if all replica nodes are down or if
> there is a topology change, the data ranges that were not covered will need
> to be repaired in the next round.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> 5. Validating data used in reads directly
> > > > >>>>>
> > > > >>>>> Verifies index content, not actual data—may miss
> low-probability errors like bit flips
> > > > >>>>>
> > > > >>>>> Verifies actual data content, providing stronger correctness
> guarantees
> > > > >>>>>
> > > > >>>>> 6. Extra data scan during inconsistency detection
> > > > >>>>>
> > > > >>>>> Since the data covered by certain indexes is not guaranteed to
> be fully contained within a single node as the topology changes, some data
> scans may be wasted.
> > > > >>>>>
> > > > >>>>> No extra data scan
> > > > >>>>>
> > > > >>>>> 7. The overhead of actual data repair after an inconsistency
> is detected
> > > > >>>>>
> > > > >>>>> Only indexes are streamed to the base table node, and the
> actual data being fixed can be as accurate as the row level.
> > > > >>>>>
> > > > >>>>> Anti-compaction is needed on the MV table, and potential
> over-streaming may occur due to the lack of row-level insight into data
> quality.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> > one of my biggest concerns I haven't seen discussed much is
> LOCAL_SERIAL/SERIAL on read
> > > > >>>>>
> > > > >>>>> Paxos v2 introduces an optimization where serial reads can be
> completed in just one round trip, reducing latency compared to traditional
> Paxos which may require multiple phases.
> > > > >>>>>
> > > > >>>>> > I think a refresh would be low-cost and give users the
> flexibility to run them however they want.
> > > > >>>>>
> > > > >>>>> I think this is an interesting idea. Does it suggest that the
> MV should be rebuilt on a regular schedule? It sounds like an extension of
> the snapshot-based approach—rather than detecting mismatches, we would
> periodically reconstruct a clean version of the MV based on the snapshot.
> This seems to diverge from the current MV model in Cassandra, where
> consistency between the MV and base table must be maintained continuously.
> This could be an extension of the CEP-48 work, where the MV is periodically
> rebuilt from a snapshot of the base table, assuming the user can tolerate
> some level of staleness in the MV data.
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
>
>
>
>
>
>
>
>
>
>

Reply via email to