> but the snapshot repair design is not a viable path forward. It’s the first > iteration of a repair design. We’ve proposed a second iteration, and we’re > open to a third iteration.
I shan't be participating further in discussion, but I want to make a point of order. The CEP process has no vetoes, so you are not empowered to declare that a design is not viable without the input of the wider community. On 2025/06/05 03:58:59 Blake Eggleston wrote: > You can detect and fix the mismatch in a single round of repair, but the > amount of work needed to do it is _significantly_ higher with snapshot > repair. Consider a case where we have a 300 node cluster w/ RF 3, where each > view partition contains entries mapping to every token range in the cluster - > so 100 ranges. If we lose a view sstable, it will affect an entire row/column > of the grid. Repair is going to scan all data in the mismatching view token > ranges 100 times, and each base range once. So you’re looking at 200 range > scans. > > Now, you may argue that you can merge the duplicate view scans into a single > scan while you repair all token ranges in parallel. I’m skeptical that’s > going to be achievable in practice, but even if it is, we’re now talking > about the view replica hypothetically doing a pairwise repair with every > other replica in the cluster at the same time. Neither of these options is > workable. > > Let’s take a step back though, because I think we’re getting lost in the > weeds. > > The repair design in the CEP has some high level concepts that make a lot of > sense, the idea of repairing a grid is really smart. However, it has some > significant drawbacks that remain unaddressed. I want this CEP to succeed, > and I know Jon does too, but the snapshot repair design is not a viable path > forward. It’s the first iteration of a repair design. We’ve proposed a second > iteration, and we’re open to a third iteration. This part of the CEP process > is meant to identify and address shortcomings, I don’t think that continuing > to dissect the snapshot repair design is making progress in that direction. > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: > > > We potentially have to do it several times on each node, depending on > > > the size of the range. Smaller ranges increase the size of the board > > > exponentially, larger ranges increase the number of SSTables that would > > > be involved in each compaction. > > As described in the CEP example, this can be handled in a single round of > > repair. We first identify all the points in the grid that require repair, > > then perform anti-compaction and stream data based on a second scan over > > those identified points. This applies to the snapshot-based > > solution—without an index, repairing a single point in that grid requires > > scanning the entire base table partition (token range). In contrast, with > > the index-based solution—as in the example you referenced—if a large block > > of data is corrupted, even though the index is used for comparison, many > > key mismatches may occur. This can lead to random disk access to the > > original data files, which could cause performance issues. For the case you > > mentioned for snapshot based solution, it should not take months to repair > > all the data, instead one round of repair should be enough. The actual > > repair phase is split from the detection phase. > > > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <j...@rustyrazorblade.com> wrote: > >> > This isn’t really the whole story. The amount of wasted scans on index > >> > repairs is negligible. If a difference is detected with snapshot repairs > >> > though, you have to read the entire partition from both the view and > >> > base table to calculate what needs to be fixed. > >> > >> You nailed it. > >> > >> When the base table is converted to a view, and sent to the view, the > >> information we have is that one of the view's partition keys needs a > >> repair. That's going to be different from the partition key of the base > >> table. As a result, on the base table, for each affected range, we'd have > >> to issue another compaction across the entire set of sstables that could > >> have the data the view needs (potentially many GB), in order to send over > >> the corrected version of the partition, then send it over to the view. > >> Without an index in place, we have to do yet another scan, per-affected > >> range. > >> > >> Consider the case of a single corrupted SSTable on the view that's removed > >> from the filesystem, or the data is simply missing after being restored > >> from an inconsistent backup. It presumably contains lots of partitions, > >> which maps to base partitions all over the cluster, in a lot of different > >> token ranges. For every one of those ranges (hundreds, to tens of > >> thousands of them given the checkerboard design), when finding the missing > >> data in the base, you'll have to perform a compaction across all the > >> SSTables that potentially contain the missing data just to rebuild the > >> view-oriented partitions that need to be sent to the view. The complexity > >> of this operation can be looked at as O(N*M) where N and M are the number > >> of ranges in the base table and the view affected by the corruption, > >> respectively. Without an index in place, finding the missing data is very > >> expensive. We potentially have to do it several times on each node, > >> depending on the size of the range. Smaller ranges increase the size of > >> the board exponentially, larger ranges increase the number of SSTables > >> that would be involved in each compaction. > >> > >> Then you send that data over to the view, the view does it's > >> anti-compaction thing, again, once per affected range. So now the view > >> has to do an anti-compaction once per block on the board that's affected > >> by the missing data. > >> > >> Doing hundreds or thousands of these will add up pretty quickly. > >> > >> When I said that a repair could take months, this is what I had in mind. > >> > >> > >> > >> > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com> > >> wrote: > >>> __ > >>> > Adds overhead in the hot path due to maintaining indexes. Extra memory > >>> > needed during write path and compaction. > >>> > >>> I’d make the same argument about the overhead of maintaining the index > >>> that Jon just made about the disk space required. The relatively > >>> predictable overhead of maintaining the index as part of the write and > >>> compaction paths is a pro, not a con. Although you’re not always paying > >>> the cost of building a merkle tree with snapshot repair, it can impact > >>> the hot path and you do have to plan for it. > >>> > >>> > Verifies index content, not actual data—may miss low-probability errors > >>> > like bit flips > >>> > >>> Presumably this could be handled by the views performing repair against > >>> each other? You could also periodically rebuild the index or perform > >>> checksums against the sstable content. > >>> > >>> > Extra data scan during inconsistency detection > >>> > Index: Since the data covered by certain indexes is not guaranteed to > >>> > be fully contained within a single node as the topology changes, some > >>> > data scans may be wasted. > >>> > Snapshots: No extra data scan > >>> > >>> This isn’t really the whole story. The amount of wasted scans on index > >>> repairs is negligible. If a difference is detected with snapshot repairs > >>> though, you have to read the entire partition from both the view and base > >>> table to calculate what needs to be fixed. > >>> > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: > >>>> One practical aspect that isn't immediately obvious is the disk space > >>>> consideration for snapshots. > >>>> > >>>> When you have a table with a mixed workload using LCS or UCS with > >>>> scaling parameters like L10 and initiate a repair, the disk usage will > >>>> increase as long as the snapshot persists and the table continues to > >>>> receive writes. This aspect is understood and factored into the design. > >>>> > >>>> However, a more nuanced point is the necessity to maintain sufficient > >>>> disk headroom specifically for running repairs. This echoes the > >>>> challenge with STCS compaction, where enough space must be available to > >>>> accommodate the largest SSTables, even when they are not being actively > >>>> compacted. > >>>> > >>>> For example, if a repair involves rewriting 100GB of SSTable data, > >>>> you'll consistently need to reserve 100GB of free space to facilitate > >>>> this. > >>>> > >>>> Therefore, while the snapshot-based approach leads to variable disk > >>>> space utilization, operators must provision storage as if the maximum > >>>> potential space will be used at all times to ensure repairs can be > >>>> executed. > >>>> > >>>> This introduces a rate of churn dynamic, where the write throughput > >>>> dictates the required extra disk space, rather than the existing on-disk > >>>> data volume. > >>>> > >>>> If 50% of your SSTables are rewritten during a snapshot, you would need > >>>> 50% free disk space. Depending on the workload, the snapshot method > >>>> could consume significantly more disk space than an index-based > >>>> approach. Conversely, for relatively static workloads, the index method > >>>> might require more space. It's not as straightforward as stating "No > >>>> extra disk space needed". > >>>> > >>>> Jon > >>>> > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote: > >>>>> > Regarding your comparison between approaches, I think you also need > >>>>> > to take into account the other dimensions that have been brought up > >>>>> > in this thread. Things like minimum repair times and vulnerability to > >>>>> > outages and topology changes are the first that come to mind. > >>>>> > >>>>> Sure, I added a few more points. > >>>>> > >>>>> *Perspective* > >>>>> > >>>>> *Index-Based Solution* > >>>>> > >>>>> *Snapshot-Based Solution* > >>>>> > >>>>> 1. Hot path overhead > >>>>> > >>>>> Adds overhead in the hot path due to maintaining indexes. Extra memory > >>>>> needed during write path and compaction. > >>>>> > >>>>> No impact on the hot path > >>>>> > >>>>> 2. Extra disk usage when repair is not running > >>>>> > >>>>> Requires additional disk space to store persistent indexes > >>>>> > >>>>> No extra disk space needed > >>>>> > >>>>> 3. Extra disk usage during repair > >>>>> > >>>>> Minimal or no additional disk usage > >>>>> > >>>>> Requires additional disk space for snapshots > >>>>> > >>>>> 4. Fine-grained repair to deal with emergency situations / topology > >>>>> changes > >>>>> > >>>>> Supports fine-grained repairs by targeting specific index ranges. This > >>>>> allows repair to be retried on smaller data sets, enabling incremental > >>>>> progress when repairing the entire table. This is especially helpful > >>>>> when there are down nodes or topology changes during repair, which are > >>>>> common in day-to-day operations. > >>>>> > >>>>> Coordination across all nodes is required over a long period of time. > >>>>> For each round of repair, if all replica nodes are down or if there is > >>>>> a topology change, the data ranges that were not covered will need to > >>>>> be repaired in the next round. > >>>>> > >>>>> > >>>>> 5. Validating data used in reads directly > >>>>> > >>>>> Verifies index content, not actual data—may miss low-probability errors > >>>>> like bit flips > >>>>> > >>>>> Verifies actual data content, providing stronger correctness guarantees > >>>>> > >>>>> 6. Extra data scan during inconsistency detection > >>>>> > >>>>> Since the data covered by certain indexes is not guaranteed to be fully > >>>>> contained within a single node as the topology changes, some data scans > >>>>> may be wasted. > >>>>> > >>>>> No extra data scan > >>>>> > >>>>> 7. The overhead of actual data repair after an inconsistency is detected > >>>>> > >>>>> Only indexes are streamed to the base table node, and the actual data > >>>>> being fixed can be as accurate as the row level. > >>>>> > >>>>> Anti-compaction is needed on the MV table, and potential over-streaming > >>>>> may occur due to the lack of row-level insight into data quality. > >>>>> > >>>>> > >>>>> > one of my biggest concerns I haven't seen discussed much is > >>>>> > LOCAL_SERIAL/SERIAL on read > >>>>> > >>>>> Paxos v2 introduces an optimization where serial reads can be completed > >>>>> in just one round trip, reducing latency compared to traditional Paxos > >>>>> which may require multiple phases. > >>>>> > >>>>> > I think a refresh would be low-cost and give users the flexibility to > >>>>> > run them however they want. > >>>>> > >>>>> I think this is an interesting idea. Does it suggest that the MV should > >>>>> be rebuilt on a regular schedule? It sounds like an extension of the > >>>>> snapshot-based approach—rather than detecting mismatches, we would > >>>>> periodically reconstruct a clean version of the MV based on the > >>>>> snapshot. This seems to diverge from the current MV model in Cassandra, > >>>>> where consistency between the MV and base table must be maintained > >>>>> continuously. This could be an extension of the CEP-48 work, where the > >>>>> MV is periodically rebuilt from a snapshot of the base table, assuming > >>>>> the user can tolerate some level of staleness in the MV data. > >>>>> > >>> >