Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Benedict Elliott Smith Fri, 06 Jun 2025 00:31:46 -0700

> but the snapshot repair design is not a viable path forward. It’s the first 
> iteration of a repair design. We’ve proposed a second iteration, and we’re 
> open to a third iteration.


I shan't be participating further in discussion, but I want to make a point of 
order. The CEP process has no vetoes, so you are not empowered to declare that 
a design is not viable without the input of the wider community.


On 2025/06/05 03:58:59 Blake Eggleston wrote:
> You can detect and fix the mismatch in a single round of repair, but the 
> amount of work needed to do it is _significantly_ higher with snapshot 
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where each 
> view partition contains entries mapping to every token range in the cluster - 
> so 100 ranges. If we lose a view sstable, it will affect an entire row/column 
> of the grid. Repair is going to scan all data in the mismatching view token 
> ranges 100 times, and each base range once. So you’re looking at 200 range 
> scans.
> 
> Now, you may argue that you can merge the duplicate view scans into a single 
> scan while you repair all token ranges in parallel. I’m skeptical that’s 
> going to be achievable in practice, but even if it is, we’re now talking 
> about the view replica hypothetically doing a pairwise repair with every 
> other replica in the cluster at the same time. Neither of these options is 
> workable.
> 
> Let’s take a step back though, because I think we’re getting lost in the 
> weeds.
> 
> The repair design in the CEP has some high level concepts that make a lot of 
> sense, the idea of repairing a grid is really smart. However, it has some 
> significant drawbacks that remain unaddressed. I want this CEP to succeed, 
> and I know Jon does too, but the snapshot repair design is not a viable path 
> forward. It’s the first iteration of a repair design. We’ve proposed a second 
> iteration, and we’re open to a third iteration. This part of the CEP process 
> is meant to identify and address shortcomings, I don’t think that continuing 
> to dissect the snapshot repair design is making progress in that direction.
> 
> On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > >  We potentially have to do it several times on each node, depending on 
> > > the size of the range. Smaller ranges increase the size of the board 
> > > exponentially, larger ranges increase the number of SSTables that would 
> > > be involved in each compaction.
> > As described in the CEP example, this can be handled in a single round of 
> > repair. We first identify all the points in the grid that require repair, 
> > then perform anti-compaction and stream data based on a second scan over 
> > those identified points. This applies to the snapshot-based 
> > solution—without an index, repairing a single point in that grid requires 
> > scanning the entire base table partition (token range). In contrast, with 
> > the index-based solution—as in the example you referenced—if a large block 
> > of data is corrupted, even though the index is used for comparison, many 
> > key mismatches may occur. This can lead to random disk access to the 
> > original data files, which could cause performance issues. For the case you 
> > mentioned for snapshot based solution, it should not take months to repair 
> > all the data, instead one round of repair should be enough. The actual 
> > repair phase is split from the detection phase.
> > 
> > 
> > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <j...@rustyrazorblade.com> wrote:
> >> > This isn’t really the whole story. The amount of wasted scans on index 
> >> > repairs is negligible. If a difference is detected with snapshot repairs 
> >> > though, you have to read the entire partition from both the view and 
> >> > base table to calculate what needs to be fixed.
> >> 
> >> You nailed it.
> >> 
> >> When the base table is converted to a view, and sent to the view, the 
> >> information we have is that one of the view's partition keys needs a 
> >> repair.  That's going to be different from the partition key of the base 
> >> table.  As a result, on the base table, for each affected range, we'd have 
> >> to issue another compaction across the entire set of sstables that could 
> >> have the data the view needs (potentially many GB), in order to send over 
> >> the corrected version of the partition, then send it over to the view.  
> >> Without an index in place, we have to do yet another scan, per-affected 
> >> range.  
> >> 
> >> Consider the case of a single corrupted SSTable on the view that's removed 
> >> from the filesystem, or the data is simply missing after being restored 
> >> from an inconsistent backup.  It presumably contains lots of partitions, 
> >> which maps to base partitions all over the cluster, in a lot of different 
> >> token ranges.  For every one of those ranges (hundreds, to tens of 
> >> thousands of them given the checkerboard design), when finding the missing 
> >> data in the base, you'll have to perform a compaction across all the 
> >> SSTables that potentially contain the missing data just to rebuild the 
> >> view-oriented partitions that need to be sent to the view.  The complexity 
> >> of this operation can be looked at as O(N*M) where N and M are the number 
> >> of ranges in the base table and the view affected by the corruption, 
> >> respectively.  Without an index in place, finding the missing data is very 
> >> expensive.  We potentially have to do it several times on each node, 
> >> depending on the size of the range.  Smaller ranges increase the size of 
> >> the board exponentially, larger ranges increase the number of SSTables 
> >> that would be involved in each compaction.
> >> 
> >> Then you send that data over to the view, the view does it's 
> >> anti-compaction thing, again, once per affected range.  So now the view 
> >> has to do an anti-compaction once per block on the board that's affected 
> >> by the missing data.  
> >> 
> >> Doing hundreds or thousands of these will add up pretty quickly.
> >> 
> >> When I said that a repair could take months, this is what I had in mind.
> >> 
> >> 
> >> 
> >> 
> >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com> 
> >> wrote:
> >>> __
> >>> > Adds overhead in the hot path due to maintaining indexes. Extra memory 
> >>> > needed during write path and compaction.
> >>> 
> >>> I’d make the same argument about the overhead of maintaining the index 
> >>> that Jon just made about the disk space required. The relatively 
> >>> predictable overhead of maintaining the index as part of the write and 
> >>> compaction paths is a pro, not a con. Although you’re not always paying 
> >>> the cost of building a merkle tree with snapshot repair, it can impact 
> >>> the hot path and you do have to plan for it.
> >>> 
> >>> > Verifies index content, not actual data—may miss low-probability errors 
> >>> > like bit flips
> >>> 
> >>> Presumably this could be handled by the views performing repair against 
> >>> each other? You could also periodically rebuild the index or perform 
> >>> checksums against the sstable content.
> >>> 
> >>> > Extra data scan during inconsistency detection
> >>> > Index: Since the data covered by certain indexes is not guaranteed to 
> >>> > be fully contained within a single node as the topology changes, some 
> >>> > data scans may be wasted.
> >>> > Snapshots: No extra data scan
> >>> 
> >>> This isn’t really the whole story. The amount of wasted scans on index 
> >>> repairs is negligible. If a difference is detected with snapshot repairs 
> >>> though, you have to read the entire partition from both the view and base 
> >>> table to calculate what needs to be fixed.
> >>> 
> >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
> >>>> One practical aspect that isn't immediately obvious is the disk space 
> >>>> consideration for snapshots.
> >>>> 
> >>>> When you have a table with a mixed workload using LCS or UCS with 
> >>>> scaling parameters like L10 and initiate a repair, the disk usage will 
> >>>> increase as long as the snapshot persists and the table continues to 
> >>>> receive writes. This aspect is understood and factored into the design.
> >>>> 
> >>>> However, a more nuanced point is the necessity to maintain sufficient 
> >>>> disk headroom specifically for running repairs. This echoes the 
> >>>> challenge with STCS compaction, where enough space must be available to 
> >>>> accommodate the largest SSTables, even when they are not being actively 
> >>>> compacted.
> >>>> 
> >>>> For example, if a repair involves rewriting 100GB of SSTable data, 
> >>>> you'll consistently need to reserve 100GB of free space to facilitate 
> >>>> this.
> >>>> 
> >>>> Therefore, while the snapshot-based approach leads to variable disk 
> >>>> space utilization, operators must provision storage as if the maximum 
> >>>> potential space will be used at all times to ensure repairs can be 
> >>>> executed.
> >>>> 
> >>>> This introduces a rate of churn dynamic, where the write throughput 
> >>>> dictates the required extra disk space, rather than the existing on-disk 
> >>>> data volume.
> >>>> 
> >>>> If 50% of your SSTables are rewritten during a snapshot, you would need 
> >>>> 50% free disk space. Depending on the workload, the snapshot method 
> >>>> could consume significantly more disk space than an index-based 
> >>>> approach. Conversely, for relatively static workloads, the index method 
> >>>> might require more space. It's not as straightforward as stating "No 
> >>>> extra disk space needed".
> >>>> 
> >>>> Jon
> >>>> 
> >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote:
> >>>>> > Regarding your comparison between approaches, I think you also need 
> >>>>> > to take into account the other dimensions that have been brought up 
> >>>>> > in this thread. Things like minimum repair times and vulnerability to 
> >>>>> > outages and topology changes are the first that come to mind.
> >>>>> 
> >>>>> Sure, I added a few more points.
> >>>>> 
> >>>>> *Perspective*
> >>>>> 
> >>>>> *Index-Based Solution*
> >>>>> 
> >>>>> *Snapshot-Based Solution*
> >>>>> 
> >>>>> 1. Hot path overhead
> >>>>> 
> >>>>> Adds overhead in the hot path due to maintaining indexes. Extra memory 
> >>>>> needed during write path and compaction.
> >>>>> 
> >>>>> No impact on the hot path
> >>>>> 
> >>>>> 2. Extra disk usage when repair is not running
> >>>>> 
> >>>>> Requires additional disk space to store persistent indexes
> >>>>> 
> >>>>> No extra disk space needed
> >>>>> 
> >>>>> 3. Extra disk usage during repair
> >>>>> 
> >>>>> Minimal or no additional disk usage
> >>>>> 
> >>>>> Requires additional disk space for snapshots
> >>>>> 
> >>>>> 4. Fine-grained repair  to deal with emergency situations / topology 
> >>>>> changes
> >>>>> 
> >>>>> Supports fine-grained repairs by targeting specific index ranges. This 
> >>>>> allows repair to be retried on smaller data sets, enabling incremental 
> >>>>> progress when repairing the entire table. This is especially helpful 
> >>>>> when there are down nodes or topology changes during repair, which are 
> >>>>> common in day-to-day operations.
> >>>>> 
> >>>>> Coordination across all nodes is required over a long period of time. 
> >>>>> For each round of repair, if all replica nodes are down or if there is 
> >>>>> a topology change, the data ranges that were not covered will need to 
> >>>>> be repaired in the next round.
> >>>>> 
> >>>>> 
> >>>>> 5. Validating data used in reads directly
> >>>>> 
> >>>>> Verifies index content, not actual data—may miss low-probability errors 
> >>>>> like bit flips
> >>>>> 
> >>>>> Verifies actual data content, providing stronger correctness guarantees
> >>>>> 
> >>>>> 6. Extra data scan during inconsistency detection
> >>>>> 
> >>>>> Since the data covered by certain indexes is not guaranteed to be fully 
> >>>>> contained within a single node as the topology changes, some data scans 
> >>>>> may be wasted.
> >>>>> 
> >>>>> No extra data scan
> >>>>> 
> >>>>> 7. The overhead of actual data repair after an inconsistency is detected
> >>>>> 
> >>>>> Only indexes are streamed to the base table node, and the actual data 
> >>>>> being fixed can be as accurate as the row level.
> >>>>> 
> >>>>> Anti-compaction is needed on the MV table, and potential over-streaming 
> >>>>> may occur due to the lack of row-level insight into data quality.
> >>>>> 
> >>>>> 
> >>>>> > one of my biggest concerns I haven't seen discussed much is 
> >>>>> > LOCAL_SERIAL/SERIAL on read
> >>>>> 
> >>>>> Paxos v2 introduces an optimization where serial reads can be completed 
> >>>>> in just one round trip, reducing latency compared to traditional Paxos 
> >>>>> which may require multiple phases.
> >>>>> 
> >>>>> > I think a refresh would be low-cost and give users the flexibility to 
> >>>>> > run them however they want.
> >>>>> 
> >>>>> I think this is an interesting idea. Does it suggest that the MV should 
> >>>>> be rebuilt on a regular schedule? It sounds like an extension of the 
> >>>>> snapshot-based approach—rather than detecting mismatches, we would 
> >>>>> periodically reconstruct a clean version of the MV based on the 
> >>>>> snapshot. This seems to diverge from the current MV model in Cassandra, 
> >>>>> where consistency between the MV and base table must be maintained 
> >>>>> continuously. This could be an extension of the CEP-48 work, where the 
> >>>>> MV is periodically rebuilt from a snapshot of the base table, assuming 
> >>>>> the user can tolerate some level of staleness in the MV data.
> >>>>> 
> >>> 
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to