Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Blake Eggleston Fri, 06 Jun 2025 11:42:17 -0700

Thanks for the updated table Runtian, I think it misses the point though. The 
problem with snapshot based repair is that it degrades operators ability to 
react to data problems by imposing a significant upfront processing burden on 
repair, and that it doesn’t scale well with cluster size, as I illustrated in 
my last email. These issues are non-starters and until they’re worked out, 
comparisons are premature. You’re not making an apples to apples comparison.


The original MV was a huge embarrassment for the project. It was really a low 
point for the credibility of the C* dev community and our ability to deliver 
features without critical design flaws. I think it’s a useful feature, and I’m 
all for fixing it, but if we’re going to go to the community and tell people 
that MVs are fixed and they can use them now, it needs to be bulletproof.

I feel good about the query execution side of the proposal, clearly a lot of 
thought has gone into designing a really solid consensus based MV system. The 
proposed repair design is not yet at the same level of maturity. Not needing 
repair is the right goal for the query execution design, but the repair design 
needs to be prepared for the worst, and it’s not. 

Repair is where a lot of the original MV use cases completely fell over. It was 
bad enough that views became inconsistent so easily, but the fact that trying 
to repair them could take down clusters was where it really became a disaster. 
Personally, I think shelving MV v2 would be better for the project than moving 
forward with a repair mechanism with these flaws. Users are going to have 
consistency problems, both self inflicted ones, and from bugs, and we need a 
solid and reliable system for fixing them if we’re going to advertise MVs as 
ready for production.

On Fri, Jun 6, 2025, at 2:03 AM, Runtian Liu wrote:
> Thanks, Blake and Jon, for your feedback—I really appreciate your time on 
> this topic and your efforts to help make this CEP a success. Throughout this 
> discussion, we've explored many interesting problems, and your input helped 
> me better understand how the index-based solution would work. Now that both 
> approaches are clearly understood, here’s a cost comparison based on 
> real-world use cases, as illustrated in the chart below. Please let me know 
> if anything else is missing in the table below; I can help facilitate a 
> comparison between both approaches so we, as an Apache Cassandra community, 
> can make an informed decision.
> 
> Operation
> 
> Resource allocated
> 
> Index based solution
> 
> Snapshot based solution
> 
> Periodic inconsistency detection (full data set)
> 
> CPU
> 
> Higher(Maintaining indexes increases CPU usage in the hot path and 
> compaction, leading to the need for higher CPU provisioning)
> 
> Lower
> 
> Memory
> 
> Higher(Maintaining indexes increases memory usage in the hot path and 
> compaction)
> 
> Lower
> 
> Disk
> 
> Lower
> 
> Higher(Snapshots use extra disk space if SSTables are compacted away, with 
> worst-case usage reaching 2×)
> 
> Adhoc Inconsistency detection (partial data set)
> 
> CPU/Memory/Disk
> 
> Supported + The cost is same as the above
> 
> Not supported
> 
> Data repair
> 
> CPU
> 
> Depending on the number of rows to be repaired, if many mismatches are 
> detected—such as a whole block of data missing during an outage recovery 
> scenario—the overhead of random data access in an index-based solution can be 
> high. However, for normal use cases where only a few rows in a token range 
> need to be fixed, the index-based solution requires significantly fewer 
> resources due to its row-level insight into the mismatched rows.
> 
> If only a few rows in a token range need to be fixed, this approach requires 
> scanning the entire partitions of both the base table and the MV table to 
> repair the data. Additionally, anti-compaction is required, which can result 
> in higher CPU usage. However, if a full token range needs to be rebuilt due 
> to a hardware failure, this approach becomes more efficient because it avoids 
> random disk access.
> 
> Disk
> 
> Only index files are exchanged; the actual data streamed is exactly the 
> inconsistent data needs to be repaired
> 
> Over-stream due to block of data streamed to the MV node
> 
> Topology Changes Challenge
> 
> One painful complexity
> 
> Indexes need to be rebuilt from time to time to maintain their effectiveness 
> 
> If all replicas are replaced during the detection phase, then the repair on 
> 100% of the data is not feasible, which means we have to wait till the next 
> cycle, delaying the overall duration
> 
> Overall
> 
> The overall resources required by the two approaches depend on the use case. 
> In general, the snapshot-based solution consumes more disk space, while the 
> index-based solution requires more CPU and memory.
> 
> 
> 
> I think that although this is called MV repair, it's quite different from 
> regular repairs in Cassandra. Standard repairs are designed to compare 
> replicas and update them to the latest version. That’s why they must complete 
> within the tombstone gc_grace_seconds period to avoid data loss. However, in 
> the case of MV repair, how do we define what’s “safe” in terms of data 
> quality? Regardless of which solution we choose, this MV repair process 
> doesn’t follow the same gc_grace_seconds requirement.
> 
> From my perspective, materialized views should always remain in sync with the 
> base table, and ideally, no repair would be necessary. However, as outlined 
> in the CEP, there are scenarios where we may need to monitor or repair 
> inconsistencies between the two tables. While such repairs are necessary, 
> they can—and should—remain infrequent. If our repair detection job 
> consistently finds a large number of mismatches, I would prefer to address 
> the root cause in the hot path rather than relying on repair.
> 
> As shown in the table above, there’s no silver bullet or universally simple 
> solution. Ideally, we would support both options and let operators choose 
> based on their needs. However, given the complexity of implementing either 
> approach, we need to select one for the initial CEP deliverables.
> 
> Overall, I’m leaning toward the snapshot-based approach. The 
> trade-off—additional disk usage during infrequent repairs—seems more 
> acceptable than adding CPU and memory overhead to the hot path, especially 
> given that MV repair is expected to be an infrequent or on-demand operation, 
> unlike full or incremental repairs which must run regularly.
> 
> 
> 
> On Fri, Jun 6, 2025 at 4:32 PM Benedict Elliott Smith <bened...@apache.org> 
> wrote:
>> > but the snapshot repair design is not a viable path forward. It’s the 
>> > first iteration of a repair design. We’ve proposed a second iteration, and 
>> > we’re open to a third iteration.
>> 
>> I shan't be participating further in discussion, but I want to make a point 
>> of order. The CEP process has no vetoes, so you are not empowered to declare 
>> that a design is not viable without the input of the wider community.
>> 
>> 
>> On 2025/06/05 03:58:59 Blake Eggleston wrote:
>> > You can detect and fix the mismatch in a single round of repair, but the 
>> > amount of work needed to do it is _significantly_ higher with snapshot 
>> > repair. Consider a case where we have a 300 node cluster w/ RF 3, where 
>> > each view partition contains entries mapping to every token range in the 
>> > cluster - so 100 ranges. If we lose a view sstable, it will affect an 
>> > entire row/column of the grid. Repair is going to scan all data in the 
>> > mismatching view token ranges 100 times, and each base range once. So 
>> > you’re looking at 200 range scans.
>> > 
>> > Now, you may argue that you can merge the duplicate view scans into a 
>> > single scan while you repair all token ranges in parallel. I’m skeptical 
>> > that’s going to be achievable in practice, but even if it is, we’re now 
>> > talking about the view replica hypothetically doing a pairwise repair with 
>> > every other replica in the cluster at the same time. Neither of these 
>> > options is workable.
>> > 
>> > Let’s take a step back though, because I think we’re getting lost in the 
>> > weeds.
>> > 
>> > The repair design in the CEP has some high level concepts that make a lot 
>> > of sense, the idea of repairing a grid is really smart. However, it has 
>> > some significant drawbacks that remain unaddressed. I want this CEP to 
>> > succeed, and I know Jon does too, but the snapshot repair design is not a 
>> > viable path forward. It’s the first iteration of a repair design. We’ve 
>> > proposed a second iteration, and we’re open to a third iteration. This 
>> > part of the CEP process is meant to identify and address shortcomings, I 
>> > don’t think that continuing to dissect the snapshot repair design is 
>> > making progress in that direction.
>> > 
>> > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
>> > > >  We potentially have to do it several times on each node, depending on 
>> > > > the size of the range. Smaller ranges increase the size of the board 
>> > > > exponentially, larger ranges increase the number of SSTables that 
>> > > > would be involved in each compaction.
>> > > As described in the CEP example, this can be handled in a single round 
>> > > of repair. We first identify all the points in the grid that require 
>> > > repair, then perform anti-compaction and stream data based on a second 
>> > > scan over those identified points. This applies to the snapshot-based 
>> > > solution—without an index, repairing a single point in that grid 
>> > > requires scanning the entire base table partition (token range). In 
>> > > contrast, with the index-based solution—as in the example you 
>> > > referenced—if a large block of data is corrupted, even though the index 
>> > > is used for comparison, many key mismatches may occur. This can lead to 
>> > > random disk access to the original data files, which could cause 
>> > > performance issues. For the case you mentioned for snapshot based 
>> > > solution, it should not take months to repair all the data, instead one 
>> > > round of repair should be enough. The actual repair phase is split from 
>> > > the detection phase.
>> > > 
>> > > 
>> > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <j...@rustyrazorblade.com> 
>> > > wrote:
>> > >> > This isn’t really the whole story. The amount of wasted scans on 
>> > >> > index repairs is negligible. If a difference is detected with 
>> > >> > snapshot repairs though, you have to read the entire partition from 
>> > >> > both the view and base table to calculate what needs to be fixed.
>> > >> 
>> > >> You nailed it.
>> > >> 
>> > >> When the base table is converted to a view, and sent to the view, the 
>> > >> information we have is that one of the view's partition keys needs a 
>> > >> repair.  That's going to be different from the partition key of the 
>> > >> base table.  As a result, on the base table, for each affected range, 
>> > >> we'd have to issue another compaction across the entire set of sstables 
>> > >> that could have the data the view needs (potentially many GB), in order 
>> > >> to send over the corrected version of the partition, then send it over 
>> > >> to the view.  Without an index in place, we have to do yet another 
>> > >> scan, per-affected range.  
>> > >> 
>> > >> Consider the case of a single corrupted SSTable on the view that's 
>> > >> removed from the filesystem, or the data is simply missing after being 
>> > >> restored from an inconsistent backup.  It presumably contains lots of 
>> > >> partitions, which maps to base partitions all over the cluster, in a 
>> > >> lot of different token ranges.  For every one of those ranges 
>> > >> (hundreds, to tens of thousands of them given the checkerboard design), 
>> > >> when finding the missing data in the base, you'll have to perform a 
>> > >> compaction across all the SSTables that potentially contain the missing 
>> > >> data just to rebuild the view-oriented partitions that need to be sent 
>> > >> to the view.  The complexity of this operation can be looked at as 
>> > >> O(N*M) where N and M are the number of ranges in the base table and the 
>> > >> view affected by the corruption, respectively.  Without an index in 
>> > >> place, finding the missing data is very expensive.  We potentially have 
>> > >> to do it several times on each node, depending on the size of the 
>> > >> range.  Smaller ranges increase the size of the board exponentially, 
>> > >> larger ranges increase the number of SSTables that would be involved in 
>> > >> each compaction.
>> > >> 
>> > >> Then you send that data over to the view, the view does it's 
>> > >> anti-compaction thing, again, once per affected range.  So now the view 
>> > >> has to do an anti-compaction once per block on the board that's 
>> > >> affected by the missing data.  
>> > >> 
>> > >> Doing hundreds or thousands of these will add up pretty quickly.
>> > >> 
>> > >> When I said that a repair could take months, this is what I had in mind.
>> > >> 
>> > >> 
>> > >> 
>> > >> 
>> > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com> 
>> > >> wrote:
>> > >>> __
>> > >>> > Adds overhead in the hot path due to maintaining indexes. Extra 
>> > >>> > memory needed during write path and compaction.
>> > >>> 
>> > >>> I’d make the same argument about the overhead of maintaining the index 
>> > >>> that Jon just made about the disk space required. The relatively 
>> > >>> predictable overhead of maintaining the index as part of the write and 
>> > >>> compaction paths is a pro, not a con. Although you’re not always 
>> > >>> paying the cost of building a merkle tree with snapshot repair, it can 
>> > >>> impact the hot path and you do have to plan for it.
>> > >>> 
>> > >>> > Verifies index content, not actual data—may miss low-probability 
>> > >>> > errors like bit flips
>> > >>> 
>> > >>> Presumably this could be handled by the views performing repair 
>> > >>> against each other? You could also periodically rebuild the index or 
>> > >>> perform checksums against the sstable content.
>> > >>> 
>> > >>> > Extra data scan during inconsistency detection
>> > >>> > Index: Since the data covered by certain indexes is not guaranteed 
>> > >>> > to be fully contained within a single node as the topology changes, 
>> > >>> > some data scans may be wasted.
>> > >>> > Snapshots: No extra data scan
>> > >>> 
>> > >>> This isn’t really the whole story. The amount of wasted scans on index 
>> > >>> repairs is negligible. If a difference is detected with snapshot 
>> > >>> repairs though, you have to read the entire partition from both the 
>> > >>> view and base table to calculate what needs to be fixed.
>> > >>> 
>> > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>> > >>>> One practical aspect that isn't immediately obvious is the disk space 
>> > >>>> consideration for snapshots.
>> > >>>> 
>> > >>>> When you have a table with a mixed workload using LCS or UCS with 
>> > >>>> scaling parameters like L10 and initiate a repair, the disk usage 
>> > >>>> will increase as long as the snapshot persists and the table 
>> > >>>> continues to receive writes. This aspect is understood and factored 
>> > >>>> into the design.
>> > >>>> 
>> > >>>> However, a more nuanced point is the necessity to maintain sufficient 
>> > >>>> disk headroom specifically for running repairs. This echoes the 
>> > >>>> challenge with STCS compaction, where enough space must be available 
>> > >>>> to accommodate the largest SSTables, even when they are not being 
>> > >>>> actively compacted.
>> > >>>> 
>> > >>>> For example, if a repair involves rewriting 100GB of SSTable data, 
>> > >>>> you'll consistently need to reserve 100GB of free space to facilitate 
>> > >>>> this.
>> > >>>> 
>> > >>>> Therefore, while the snapshot-based approach leads to variable disk 
>> > >>>> space utilization, operators must provision storage as if the maximum 
>> > >>>> potential space will be used at all times to ensure repairs can be 
>> > >>>> executed.
>> > >>>> 
>> > >>>> This introduces a rate of churn dynamic, where the write throughput 
>> > >>>> dictates the required extra disk space, rather than the existing 
>> > >>>> on-disk data volume.
>> > >>>> 
>> > >>>> If 50% of your SSTables are rewritten during a snapshot, you would 
>> > >>>> need 50% free disk space. Depending on the workload, the snapshot 
>> > >>>> method could consume significantly more disk space than an 
>> > >>>> index-based approach. Conversely, for relatively static workloads, 
>> > >>>> the index method might require more space. It's not as 
>> > >>>> straightforward as stating "No extra disk space needed".
>> > >>>> 
>> > >>>> Jon
>> > >>>> 
>> > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote:
>> > >>>>> > Regarding your comparison between approaches, I think you also 
>> > >>>>> > need to take into account the other dimensions that have been 
>> > >>>>> > brought up in this thread. Things like minimum repair times and 
>> > >>>>> > vulnerability to outages and topology changes are the first that 
>> > >>>>> > come to mind.
>> > >>>>> 
>> > >>>>> Sure, I added a few more points.
>> > >>>>> 
>> > >>>>> *Perspective*
>> > >>>>> 
>> > >>>>> *Index-Based Solution*
>> > >>>>> 
>> > >>>>> *Snapshot-Based Solution*
>> > >>>>> 
>> > >>>>> 1. Hot path overhead
>> > >>>>> 
>> > >>>>> Adds overhead in the hot path due to maintaining indexes. Extra 
>> > >>>>> memory needed during write path and compaction.
>> > >>>>> 
>> > >>>>> No impact on the hot path
>> > >>>>> 
>> > >>>>> 2. Extra disk usage when repair is not running
>> > >>>>> 
>> > >>>>> Requires additional disk space to store persistent indexes
>> > >>>>> 
>> > >>>>> No extra disk space needed
>> > >>>>> 
>> > >>>>> 3. Extra disk usage during repair
>> > >>>>> 
>> > >>>>> Minimal or no additional disk usage
>> > >>>>> 
>> > >>>>> Requires additional disk space for snapshots
>> > >>>>> 
>> > >>>>> 4. Fine-grained repair  to deal with emergency situations / topology 
>> > >>>>> changes
>> > >>>>> 
>> > >>>>> Supports fine-grained repairs by targeting specific index ranges. 
>> > >>>>> This allows repair to be retried on smaller data sets, enabling 
>> > >>>>> incremental progress when repairing the entire table. This is 
>> > >>>>> especially helpful when there are down nodes or topology changes 
>> > >>>>> during repair, which are common in day-to-day operations.
>> > >>>>> 
>> > >>>>> Coordination across all nodes is required over a long period of 
>> > >>>>> time. For each round of repair, if all replica nodes are down or if 
>> > >>>>> there is a topology change, the data ranges that were not covered 
>> > >>>>> will need to be repaired in the next round.
>> > >>>>> 
>> > >>>>> 
>> > >>>>> 5. Validating data used in reads directly
>> > >>>>> 
>> > >>>>> Verifies index content, not actual data—may miss low-probability 
>> > >>>>> errors like bit flips
>> > >>>>> 
>> > >>>>> Verifies actual data content, providing stronger correctness 
>> > >>>>> guarantees
>> > >>>>> 
>> > >>>>> 6. Extra data scan during inconsistency detection
>> > >>>>> 
>> > >>>>> Since the data covered by certain indexes is not guaranteed to be 
>> > >>>>> fully contained within a single node as the topology changes, some 
>> > >>>>> data scans may be wasted.
>> > >>>>> 
>> > >>>>> No extra data scan
>> > >>>>> 
>> > >>>>> 7. The overhead of actual data repair after an inconsistency is 
>> > >>>>> detected
>> > >>>>> 
>> > >>>>> Only indexes are streamed to the base table node, and the actual 
>> > >>>>> data being fixed can be as accurate as the row level.
>> > >>>>> 
>> > >>>>> Anti-compaction is needed on the MV table, and potential 
>> > >>>>> over-streaming may occur due to the lack of row-level insight into 
>> > >>>>> data quality.
>> > >>>>> 
>> > >>>>> 
>> > >>>>> > one of my biggest concerns I haven't seen discussed much is 
>> > >>>>> > LOCAL_SERIAL/SERIAL on read
>> > >>>>> 
>> > >>>>> Paxos v2 introduces an optimization where serial reads can be 
>> > >>>>> completed in just one round trip, reducing latency compared to 
>> > >>>>> traditional Paxos which may require multiple phases.
>> > >>>>> 
>> > >>>>> > I think a refresh would be low-cost and give users the flexibility 
>> > >>>>> > to run them however they want.
>> > >>>>> 
>> > >>>>> I think this is an interesting idea. Does it suggest that the MV 
>> > >>>>> should be rebuilt on a regular schedule? It sounds like an extension 
>> > >>>>> of the snapshot-based approach—rather than detecting mismatches, we 
>> > >>>>> would periodically reconstruct a clean version of the MV based on 
>> > >>>>> the snapshot. This seems to diverge from the current MV model in 
>> > >>>>> Cassandra, where consistency between the MV and base table must be 
>> > >>>>> maintained continuously. This could be an extension of the CEP-48 
>> > >>>>> work, where the MV is periodically rebuilt from a snapshot of the 
>> > >>>>> base table, assuming the user can tolerate some level of staleness 
>> > >>>>> in the MV data.
>> > >>>>> 
>> > >>> 
>> >

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to