> This isn’t really the whole story. The amount of wasted scans on index
repairs is negligible. If a difference is detected with snapshot repairs
though, you have to read the entire partition from both the view and base
table to calculate what needs to be fixed.

You nailed it.

When the base table is converted to a view, and sent to the view, the
information we have is that one of the view's partition keys needs a
repair.  That's going to be different from the partition key of the base
table.  As a result, on the base table, for each affected range, we'd have
to issue another compaction across the entire set of sstables that could
have the data the view needs (potentially many GB), in order to send over
the corrected version of the partition, then send it over to the view.
Without an index in place, we have to do yet another scan, per-affected
range.

Consider the case of a single corrupted SSTable on the view that's removed
from the filesystem, or the data is simply missing after being restored
from an inconsistent backup.  It presumably contains lots of partitions,
which maps to base partitions all over the cluster, in a lot of different
token ranges.  For every one of those ranges (hundreds, to tens of
thousands of them given the checkerboard design), when finding the missing
data in the base, you'll have to perform a compaction across all the
SSTables that potentially contain the missing data just to rebuild the
view-oriented partitions that need to be sent to the view.  The complexity
of this operation can be looked at as O(N*M) where N and M are the number
of ranges in the base table and the view affected by the corruption,
respectively.  Without an index in place, finding the missing data is very
expensive.  We potentially have to do it several times on each node,
depending on the size of the range.  Smaller ranges increase the size of
the board exponentially, larger ranges increase the number of SSTables that
would be involved in each compaction.

Then you send that data over to the view, the view does it's
anti-compaction thing, again, once per affected range.  So now the view has
to do an anti-compaction once per block on the board that's affected by the
missing data.

Doing hundreds or thousands of these will add up pretty quickly.

When I said that a repair could take months, this is what I had in mind.




On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com>
wrote:

> > Adds overhead in the hot path due to maintaining indexes. Extra memory
> needed during write path and compaction.
>
> I’d make the same argument about the overhead of maintaining the index
> that Jon just made about the disk space required. The relatively
> predictable overhead of maintaining the index as part of the write and
> compaction paths is a pro, not a con. Although you’re not always paying the
> cost of building a merkle tree with snapshot repair, it can impact the hot
> path and you do have to plan for it.
>
> > Verifies index content, not actual data—may miss low-probability errors
> like bit flips
>
> Presumably this could be handled by the views performing repair against
> each other? You could also periodically rebuild the index or perform
> checksums against the sstable content.
>
> > Extra data scan during inconsistency detection
> > Index: Since the data covered by certain indexes is not guaranteed to be
> fully contained within a single node as the topology changes, some data
> scans may be wasted.
> > Snapshots: No extra data scan
>
> This isn’t really the whole story. The amount of wasted scans on index
> repairs is negligible. If a difference is detected with snapshot repairs
> though, you have to read the entire partition from both the view and base
> table to calculate what needs to be fixed.
>
> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>
> One practical aspect that isn't immediately obvious is the disk space
> consideration for snapshots.
>
> When you have a table with a mixed workload using LCS or UCS with scaling
> parameters like L10 and initiate a repair, the disk usage will increase as
> long as the snapshot persists and the table continues to receive writes.
> This aspect is understood and factored into the design.
>
> However, a more nuanced point is the necessity to maintain sufficient disk
> headroom specifically for running repairs. This echoes the challenge with
> STCS compaction, where enough space must be available to accommodate the
> largest SSTables, even when they are not being actively compacted.
>
> For example, if a repair involves rewriting 100GB of SSTable data, you'll
> consistently need to reserve 100GB of free space to facilitate this.
>
> Therefore, while the snapshot-based approach leads to variable disk space
> utilization, operators must provision storage as if the maximum potential
> space will be used at all times to ensure repairs can be executed.
>
> This introduces a rate of churn dynamic, where the write throughput
> dictates the required extra disk space, rather than the existing on-disk
> data volume.
>
> If 50% of your SSTables are rewritten during a snapshot, you would need
> 50% free disk space. Depending on the workload, the snapshot method could
> consume significantly more disk space than an index-based approach.
> Conversely, for relatively static workloads, the index method might require
> more space. It's not as straightforward as stating "No extra disk space
> needed".
>
> Jon
>
> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote:
>
> > Regarding your comparison between approaches, I think you also need to
> take into account the other dimensions that have been brought up in this
> thread. Things like minimum repair times and vulnerability to outages and
> topology changes are the first that come to mind.
>
> Sure, I added a few more points.
>
> *Perspective*
>
> *Index-Based Solution*
>
> *Snapshot-Based Solution*
>
> 1. Hot path overhead
>
> Adds overhead in the hot path due to maintaining indexes. Extra memory
> needed during write path and compaction.
>
> No impact on the hot path
>
> 2. Extra disk usage when repair is not running
>
> Requires additional disk space to store persistent indexes
>
> No extra disk space needed
>
> 3. Extra disk usage during repair
>
> Minimal or no additional disk usage
>
> Requires additional disk space for snapshots
>
> 4. Fine-grained repair  to deal with emergency situations / topology
> changes
>
> Supports fine-grained repairs by targeting specific index ranges. This
> allows repair to be retried on smaller data sets, enabling incremental
> progress when repairing the entire table. This is especially helpful when
> there are down nodes or topology changes during repair, which are common in
> day-to-day operations.
>
> Coordination across all nodes is required over a long period of time. For
> each round of repair, if all replica nodes are down or if there is a
> topology change, the data ranges that were not covered will need to be
> repaired in the next round.
>
> 5. Validating data used in reads directly
>
> Verifies index content, not actual data—may miss low-probability errors
> like bit flips
>
> Verifies actual data content, providing stronger correctness guarantees
>
> 6. Extra data scan during inconsistency detection
>
> Since the data covered by certain indexes is not guaranteed to be fully
> contained within a single node as the topology changes, some data scans may
> be wasted.
>
> No extra data scan
>
> 7. The overhead of actual data repair after an inconsistency is detected
>
> Only indexes are streamed to the base table node, and the actual data
> being fixed can be as accurate as the row level.
>
> Anti-compaction is needed on the MV table, and potential over-streaming
> may occur due to the lack of row-level insight into data quality.
>
> > one of my biggest concerns I haven't seen discussed much is
> LOCAL_SERIAL/SERIAL on read
>
> Paxos v2 introduces an optimization where serial reads can be completed in
> just one round trip, reducing latency compared to traditional Paxos which
> may require multiple phases.
>
> > I think a refresh would be low-cost and give users the flexibility to
> run them however they want.
>
> I think this is an interesting idea. Does it suggest that the MV should be
> rebuilt on a regular schedule? It sounds like an extension of the
> snapshot-based approach—rather than detecting mismatches, we would
> periodically reconstruct a clean version of the MV based on the snapshot.
> This seems to diverge from the current MV model in Cassandra, where
> consistency between the MV and base table must be maintained continuously.
> This could be an extension of the CEP-48 work, where the MV is periodically
> rebuilt from a snapshot of the base table, assuming the user can tolerate
> some level of staleness in the MV data.
>
>
>

Reply via email to