One last thing. I'm pretty sure building the tree requires the keys be added in token order: https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173
Which definitely introduces a bit of a problem, given that the tree would be constructed from the transformed v1, which is a value unpredictable enough to be considered random. The only way I can think of to address this would be to maintain a local index on v1. See my previous email where I mentioned this. Base Table -> Local Index -> Global Index Still a really hard problem. Jon On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com> wrote: > There's a lot here that's still confusing to me. Maybe you can help me > understand it better? Apologies in advance for the text wall :) > > I'll use this schema as an example: > > --------- > CREATE TABLE test.t1 ( > id int PRIMARY KEY, > v1 int > ); > > create MATERIALIZED VIEW test_mv as > SELECT v1, id from test.t1 where id is not null and v1 is not null primary > key (v1, id); > --------- > > We've got (id, v1) in the base table and (v1, id) in the MV. > > During the repair, we snapshot, and construct a whole bunch of merkle > trees. CEP-48 says they will be persisted to disk. > > ** *Do you intend on building all the Merkle trees in parallel? > * Will there be hundreds of files doing random IO to persist the trees to > disk, in addition to the sequential IO from repair? > * Is the intention of persisting the trees to disk to recover from > failure, or just to limit memory usage? > ** *Have you calculated the Merkle tree space requirements? > * When do we build the Merkle trees for the view? Is that happening in > parallel with the base table? Do we have the computational complexity of 2 > full cluster repairs running simultaneously, or does it take twice as long? > > I'm very curious to hear if anyone has run a full cluster repair recently > on a non-trivial dataset. Every cluster I work with only does subrange > repair. I can't even recall the last time I did a full repair on a large > cluster. I may never have, now that I think about it. Every time I've > done this in the past it's been plagued with issues, both in terms of > performance and reliability. Subrange repair works because it can make > progress in 15-30 minute increments. > > Anyways - moving on... > > You suggest we read the base table and construct the Merkle trees based on > the transformed rows. Using my schema above, we take the v1 field and use > token(v1), to build the tree. Assuming that a value for v1 appears many > times throughout the dataset across many partitions, how do you intend on > calculating it's hash? If you look at Validator.rowHash [1] and > Validator.add, you'll see it's taking an UnfilteredRowIterator for an > entire partition and calculates the hash based on that. Here's the comment: > > /** > * Called (in order) for every row present in the CF. > * Hashes the row, and adds it to the tree being built. > * > * @param partition Partition to add hash > */ > public void add(UnfilteredRowIterator partition) > > So it seems to me like you need to have the entire partition materialized > in memory before adding to the tree. Doing that per value v1 without an > index is pretty much impossible - we'd have to scan the entire dataset once > per partition to pull out all the matching v1 values, or you'd need to > materialize the entire dataset into a local version of the MV for that > range. I don't know how you could do this. Do you have a workaround for > this planned? Maybe someone that knows the Merkle tree code better can > chime in. > > Maybe there's something else here I'm not aware of - please let me know > what I'm missing here if I am, it would be great to see this in the doc if > you have a solution. > > For the sake of discussion, let's assume we've moved past this and we have > our tree for a hundreds of ranges built from the base table & built for the > MV, now we move onto the comparison. > > In the doc at this point, we delete the snapshot because we have the tree > structures and we compare Merkle trees. Then we stream mismatched data. > > So let's say we find a mismatch in a hash. That indicates that there's > some range of data where we have an issue. For some token range calculated > from the v1 field, we have a mismatch, right? What do we do with that > information? > > * Do we tell the node that owned the base table - hey, stream the data > from base where token(v1) is in range [X,Y) to me? > * That means we have to scan through the base again for all rows where > token(v1) in [X,Y) range, right? Because without an index on the hashes of > v1, we're doing a full table scan and hashing every v1 value to find out if > it needs to be streamed back to the MV. > * Are we doing this concurrently on all nodes? > * Will there be coordination between all nodes in the cluster to ensure > you don't have to do multiple scans? > > I realized there's a lot of questions here, but unfortunately I'm having a > hard time seeing how we can workaround some of the core assumptions around > constructing Merkle trees and using them to resolve the differences in a > way that matches up with what's in the doc. I have quite a few more things > to discuss, but I'll save them for a follow up once all these have been > sorted out. > > Thanks in advance! > Jon > > [1] > https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209 > > > > On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote: > >> The previous table compared the complexity of full repair and MV repair >> when reconciling one dataset with another. In production, we typically use >> a replication factor of 3 in one datacenter. This means full repair >> involves 3n rows, while MV repair involves comparing 6n rows (base + MV). >> Below is an updated comparison table reflecting this scenario. >> >> n: number of rows to repair (Total rows in the table) >> >> d: depth of one Merkle tree for MV repair >> >> r: number of split ranges >> >> p: data compacted away >> >> >> This comparison focuses on the complexities of one round of full repair >> with a replication factor of 3 versus repairing a single MV based on one >> base table with replication factor 3. >> >> Full Repair >> >> MV Repair >> >> Comment >> >> Extra disk used >> >> 0 >> >> O(2*p) >> >> Since we take a snapshot at the beginning of the repair, any disk space >> that would normally be freed by compaction will remain occupied until the >> Merkle trees are successfully built and the snapshot is cleared. >> >> Data scan complexity >> >> O(3*n) >> >> O(6*n) >> >> Full repair scans n rows from the primary and 2n from replicas.3 >> >> MV repair scans 3n rows from the base table and 3n from the MV. >> >> Merkle Tree building time complexity >> >> O(3n) >> >> O(6*n*d) >> >> In full repair, Merkle tree building is O(1) per row—each hash is added >> sequentially to the leaf nodes. >> >> In MV repair, each hash is inserted from the root, making it O(d) per >> row. Since d is typically small (less than 20 and often smaller than in >> full repair), this isn’t a major concern. >> >> Total Merkle tree count >> >> O(3*r) >> >> O(6*r^2) >> >> MV repair needs to generate more, smaller Merkle trees, but this isn’t a >> concern as they can be persisted to disk during the repair process. >> >> Merkle tree comparison complexity >> >> O(3n) >> >> O(3n) >> >> Assuming one row maps to one leaf node, both repairs are equivalent. >> >> Stream time complexity >> >> O(3n) >> >> O(3n) >> >> Assuming all rows need to be streamed, both repairs are equivalent. >> >> In short: Even for production use cases having RF=3 in one data center, >> we can see that the MV repair consumes temporary disk space and a small, >> usually negligible amount of extra CPU for tree construction; other costs >> match full repair. >> >> Additionally, with the online path proposed in this CEP, we expect >> mismatches to be rare, which can lower the frequency of running this repair >> process compared to full repair. >> >> >> On Thu, May 15, 2025 at 9:53 AM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >>> > They are not two unordered sets, but rather two sets ordered by >>> different keys. >>> >>> I think this is a distinction without a difference. Merkle tree repair >>> works because the ordering of the data is mostly the same across nodes. >>> >>> >>> On Thu, May 15, 2025 at 9:27 AM Runtian Liu <curly...@gmail.com> wrote: >>> >>>> > what we're trying to achieve here is comparing two massive unordered >>>> sets. >>>> >>>> They are not two unordered sets, but rather two sets ordered by >>>> different keys. This means that when building Merkle trees for the base >>>> table and the materialized view (MV), we need to use different strategies >>>> to ensure the trees can be meaningfully compared. >>>> >>>> To address scalability concerns for MV repair, I’ve included a >>>> comparison between one round of full repair and MV repair in the table >>>> below. This comparison is also added to the CEP. >>>> >>>> n: number of rows to repair (Total rows in the table) >>>> >>>> d: depth of one Merkle tree for MV repair >>>> >>>> r: number of split ranges >>>> >>>> p: data compacted away >>>> >>>> >>>> This comparison focuses on the complexities of one round of full repair >>>> with a replication factor of 2 versus repairing a single MV based on one >>>> base table replica. >>>> >>>> Full Repair >>>> >>>> MV Repair >>>> >>>> Comment >>>> >>>> Extra disk used >>>> >>>> 0 >>>> >>>> O(2*p) >>>> >>>> Since we take a snapshot at the beginning of the repair, any disk space >>>> that would normally be freed by compaction will remain occupied until the >>>> Merkle trees are successfully built and the snapshot is cleared. >>>> >>>> Data scan complexity >>>> >>>> O(2*n) >>>> >>>> O(2*n) >>>> >>>> Full repair scans n rows from the primary and n from replicas. >>>> >>>> MV repair scans n rows from the base table primary replica only, and n >>>> from the MV primary replica only. >>>> >>>> Merkle Tree building time complexity >>>> >>>> O(n) >>>> >>>> O(n*d) >>>> >>>> In full repair, Merkle tree building is O(1) per row—each hash is >>>> added sequentially to the leaf nodes. >>>> >>>> In MV repair, each hash is inserted from the root, making it O(d) per >>>> row. Since d is typically small (less than 20 and often smaller than >>>> in full repair), this isn’t a major concern. >>>> >>>> Total Merkle tree count >>>> >>>> O(2*r) >>>> >>>> O(2*r^2) >>>> >>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t >>>> a concern as they can be persisted to disk during the repair process. >>>> >>>> Merkle tree comparison complexity >>>> >>>> O(n) >>>> >>>> O(n) >>>> >>>> Assuming one row maps to one leaf node, both repairs are equivalent. >>>> >>>> Stream time complexity >>>> >>>> O(n) >>>> >>>> O(n) >>>> >>>> Assuming all rows need to be streamed, both repairs are equivalent. >>>> >>>> In short: MV repair consumes temporary disk space and a small, usually >>>> negligible amount of extra CPU for tree construction; other costs match >>>> full repair. >>>> >>>> The core idea behind the proposed MV repair is as follows: >>>> >>>> 1. >>>> >>>> Take a snapshot to “freeze” the current state of both the base >>>> table and its MV. >>>> 2. >>>> >>>> Gradually scan the data from both tables to build Merkle trees. >>>> 3. >>>> >>>> Identify the token ranges where inconsistencies exist. >>>> 4. >>>> >>>> Rebuild only the mismatched ranges rather than the entire MV. >>>> >>>> With transaction-backed MVs, step 4 should rarely be necessary. >>>> >>>> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie <jmcken...@apache.org> >>>> wrote: >>>> >>>>> I think in order to address this, the view should be propagated to the >>>>> base replicas *after* it's accepted by all or a majority of base replicas. >>>>> This is where I think mutation tracking could probably help. >>>>> >>>>> Yeah, the idea of "don't reflect in the MV until you hit the CL the >>>>> user requested for the base table". Introduces disjoint risk if you have >>>>> coordinator death mid-write where replicas got base-data but that 2nd step >>>>> didn't take place; think that's why Runtien et. al are looking at paxos >>>>> repair picking up those pieces for you after the fact to get you back into >>>>> consistency. Mutation tracking and Accord both have similar guarantees in >>>>> this space. >>>>> >>>>> I think this would ensure that as long as there's no data loss or >>>>> bit-rot, the base and view can be repaired independently. When there is >>>>> data loss or bit-rot in either the base table or the view, then it is the >>>>> same as 2i today: rebuild is required. >>>>> >>>>> And the repair as proposed in the CEP should resolve the bitrot and >>>>> bug dataloss case I think. Certainly has much higher time complexity but >>>>> the bounding of memory complexity to be comparable with regular repair >>>>> doesn't strike me as a dealbreaker. >>>>> >>>>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote: >>>>> >>>>> > I think requiring a rebuild is a deal breaker for most teams. In >>>>> most instances it would be having to also expand the cluster to handle the >>>>> additional disk requirements. It turns an inconsistency problem into a >>>>> major operational headache that can take weeks to resolve. >>>>> >>>>> Agreed. The rebuild would not be required during normal operations >>>>> when the cluster is properly maintained (ie. regular repair) - only in >>>>> catastrophic situations. This is also the case for ordinary tables >>>>> currently: if there's data loss, then restoring from a backup is needed. >>>>> This could be a possible alternative to not require a rebuild in this >>>>> extraordinary scenario. >>>>> >>>>> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com> >>>>> wrote: >>>>> >>>>> I think requiring a rebuild is a deal breaker for most teams. In most >>>>> instances it would be having to also expand the cluster to handle the >>>>> additional disk requirements. It turns an inconsistency problem into a >>>>> major operational headache that can take weeks to resolve. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com> >>>>> wrote: >>>>> >>>>> > There's bi-directional entropy issues with MV's - either orphaned >>>>> view data or missing view data; that's why you kind of need a >>>>> "bi-directional ETL" to make sure the 2 agree with each other. While >>>>> normal >>>>> repair would resolve the "missing data in MV" case, it wouldn't resolve >>>>> the >>>>> "data in MV that's not in base table anymore" case, which afaict all base >>>>> consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) >>>>> are >>>>> vulnerable to. >>>>> >>>>> I don't think that bi-directional reconciliation should be a >>>>> requirement, when the base table is assumed to be the source of truth as >>>>> stated in the CEP doc. >>>>> >>>>> I think the main issue with the current MV implementation is that each >>>>> view replica is independently replicated by the base replica, before the >>>>> base write is acknowledged. >>>>> >>>>> This creates a correctness issue in the write path, because a view >>>>> update can be created for a write that was not accepted by the coordinator >>>>> in the following scenario: >>>>> >>>>> N=RF=3 >>>>> CL=ONE >>>>> - Update U is propagated to view replica V, coordinator that is also >>>>> base replica B dies before accepting base table write request to client. >>>>> Now U exists in V but not in B. >>>>> >>>>> I think in order to address this, the view should be propagated to the >>>>> base replicas *after* it's accepted by all or a majority of base replicas. >>>>> This is where I think mutation tracking could probably help. >>>>> >>>>> I think this would ensure that as long as there's no data loss or >>>>> bit-rot, the base and view can be repaired independently. When there is >>>>> data loss or bit-rot in either the base table or the view, then it is the >>>>> same as 2i today: rebuild is required. >>>>> >>>>> > It'd be correct (if operationally disappointing) to be able to just >>>>> say "if you have data loss in your base table you need to rebuild the >>>>> corresponding MV's", but the problem is operators aren't always going to >>>>> know when that data loss occurs. Not everything is as visible as a lost >>>>> quorum of replicas or blown up SSTables. >>>>> >>>>> I think there are opportunities to improve rebuild speed, assuming the >>>>> base table as a source of truth. For example, rebuild only subranges when >>>>> data-loss is detected. >>>>> >>>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org> >>>>> wrote: >>>>> >>>>> >>>>> There's bi-directional entropy issues with MV's - either orphaned view >>>>> data or missing view data; that's why you kind of need a "bi-directional >>>>> ETL" to make sure the 2 agree with each other. While normal repair would >>>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV >>>>> that's not in base table anymore" case, which afaict all base consistency >>>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable >>>>> to. >>>>> >>>>> It'd be correct (if operationally disappointing) to be able to just >>>>> say "if you have data loss in your base table you need to rebuild the >>>>> corresponding MV's", but the problem is operators aren't always going to >>>>> know when that data loss occurs. Not everything is as visible as a lost >>>>> quorum of replicas or blown up SSTables. >>>>> >>>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote: >>>>> >>>>> Maybe, I’m not really familiar enough with how “classic” MV repair >>>>> works to say. You can’t mix normal repair and mutation reconciliation in >>>>> the current incarnation of mutation tracking though, so I wouldn’t assume >>>>> it would work with MVs. >>>>> >>>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote: >>>>> >>>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair >>>>> (just the MV against the other nodes) resolve the issue? >>>>> >>>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> >>>>> wrote: >>>>> >>>>> >>>>> Mutation tracking is definitely an approach you could take for MVs. >>>>> Mutation reconciliation could be extended to ensure all changes have been >>>>> replicated to the views. When a base table received a mutation w/ an id it >>>>> would generate a view update. If you block marking a given mutation id as >>>>> reconciled until it’s been fully replicated to the base table and its view >>>>> updates have been fully replicated to the views, then all view updates >>>>> will >>>>> eventually be applied as part of the log reconciliation process. >>>>> >>>>> A mutation tracking implementation would also allow you to be more >>>>> flexible with the types of consistency levels you can work with, allowing >>>>> users to do things like use LOCAL_QUORUM without leaving themselves open >>>>> to >>>>> introducing view inconsistencies. >>>>> >>>>> That would more or less eliminate the need for any MV repair in normal >>>>> usage, but wouldn't address how to repair issues caused by bugs or data >>>>> loss, though you may be able to do something with comparing the latest >>>>> mutation ids for the base tables and its view ranges. >>>>> >>>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote: >>>>> >>>>> I don't see mutation tracking [1] mentioned in this thread or in the >>>>> CEP-48 description. Not sure this would fit into the scope of this >>>>> initial CEP, but I have a feeling that mutation tracking could be >>>>> potentially helpful to reconcile base tables and views ? >>>>> >>>>> For example, when both base and view updates are acknowledged then >>>>> this could be somehow persisted in the view sstables mutation tracking >>>>> summary[2] or similar metadata ? Then these updates would be skipped >>>>> during >>>>> view repair, considerably reducing the amount of work needed, since only >>>>> un-acknowledged views updates would need to be reconciled. >>>>> >>>>> [1] - >>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking| >>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C> >>>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336 >>>>> >>>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com> >>>>> wrote: >>>>> >>>>> > - The first thing I notice is that we're talking about repairing the >>>>> entire table across the entire cluster all in one go. It's been a *long* >>>>> time since I tried to do a full repair of an entire table without using >>>>> sub-ranges. Is anyone here even doing that with clusters of non-trivial >>>>> size? How long does a full repair of a 100 node cluster with 5TB / node >>>>> take even in the best case scenario? >>>>> >>>>> I haven't checked the CEP yet so I may be missing out something but I >>>>> think this effort doesn't need to be conflated with dense node support, to >>>>> make this more approachable. I think prospective users would be OK with >>>>> overprovisioning to make this feasible if needed. We could perhaps have >>>>> size guardrails that limit the maximum table size per node when MVs are >>>>> enabled. Ideally we should make it work for dense nodes if possible, but >>>>> this shouldn't be a reason not to support the feature if it can be made to >>>>> work reasonably with more resources. >>>>> >>>>> I think the main issue with the current MV is about correctness, and >>>>> the ultimate goal of the CEP must be to provide correctness guarantees, >>>>> even if it has an inevitable performance hit. I think that the performance >>>>> of the repair process is definitely an important consideration and it >>>>> would >>>>> be helpful to have some benchmarks to have an idea of how long this repair >>>>> process would take for lightweight and denser tables. >>>>> >>>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> >>>>> wrote: >>>>> >>>>> I've got several concerns around this repair process. >>>>> >>>>> - The first thing I notice is that we're talking about repairing the >>>>> entire table across the entire cluster all in one go. It's been a *long* >>>>> time since I tried to do a full repair of an entire table without using >>>>> sub-ranges. Is anyone here even doing that with clusters of non trivial >>>>> size? How long does a full repair of a 100 node cluster with 5TB / node >>>>> take even in the best case scenario? >>>>> >>>>> - Even in a scenario where sub-range repair is supported, you'd have >>>>> to scan *every* sstable on the base table in order to construct the a >>>>> merkle tree, as we don't know in advance which SSTables contain the ranges >>>>> that the MV will. That means a subrange repair would have to do a *ton* >>>>> of >>>>> IO. Anyone who's mis-configured a sub-range incremental repair to use too >>>>> many ranges will probably be familiar with how long it can take to >>>>> anti-compact a bunch of SSTables. With MV sub-range repair, we'd have >>>>> even >>>>> more overhead, because we'd have to read in every SSTable, every time. If >>>>> we do 10 subranges, we'll do 10x the IO of a normal repair. I don't think >>>>> this is practical. >>>>> >>>>> - Merkle trees make sense when you're comparing tables with the same >>>>> partition key, but I don't think they do when you're transforming a base >>>>> table to a view. When there's a mis-match, what's transferred? We have a >>>>> range of data in the MV, but now we have to go find that from the base >>>>> table. That means the merkle tree needs to not just track the hashes and >>>>> ranges, but the original keys it was transformed from, in order to go find >>>>> all of the matching partitions in that mis-matched range. Either that or >>>>> we end up rescanning the entire dataset in order to find the mismatches. >>>>> >>>>> Jon >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> >>>>> wrote: >>>>> >>>>> > Looking at the details of the CEP it seems to describe Paxos as >>>>> PaxosV1, but PaxosV2 works slightly differently (it can read during the >>>>> prepare phase). I assume that supporting Paxos means supporting both V1 >>>>> and >>>>> V2 for materialized views? >>>>> We are going to support Paxos V2. The CEP is not clear on that, we add >>>>> this to clarify that. >>>>> >>>>> It looks like the online portion is now fairly well understood. For >>>>> the offline repair part, I see two main concerns: one around the >>>>> scalability of the proposed approach, and another regarding how it handles >>>>> tombstones. >>>>> >>>>> *Scalability:* >>>>> I have added a *section* >>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample> >>>>> in the CEP with an example to compare full repair and the proposed MV >>>>> repair, the overall scalability should not be a problem. >>>>> >>>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, >>>>> where each node owns one token. The base table uses (pk, ck) as its >>>>> primary >>>>> key, while the materialized view (MV) uses (ck, pk) as its primary key. >>>>> Both tables include a value column v, which allows us to correlate rows >>>>> between them. The dataset consists of 16 records, distributed as follows: >>>>> >>>>> >>>>> *Base table* >>>>> (pk, ck, v) >>>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1 >>>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2 >>>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3 >>>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4 >>>>> >>>>> >>>>> >>>>> *Materialized view* >>>>> (ck, pk, v) >>>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1 >>>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2 >>>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3 >>>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4 >>>>> >>>>> >>>>> The chart below compares one round of full repair with one round of MV >>>>> repair. As shown, both scan the same total number of rows. However, MV >>>>> repair has higher time complexity because its Merkle tree processes each >>>>> row more intensively. To avoid all nodes scanning the entire table >>>>> simultaneously, MV repair should use a snapshot-based approach, similar to >>>>> normal repair with the --sequential option. Time complexity increase >>>>> compare to full repair can be found in the "Complexity and Memory >>>>> Management" section. >>>>> >>>>> >>>>> n: number of rows >>>>> >>>>> d: depth of one Merkle tree for MV repair >>>>> >>>>> d': depth of one Merkle tree for full repair >>>>> >>>>> r: number of split ranges >>>>> >>>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r. >>>>> >>>>> We can see that the space complexity is the same, while MV repair has >>>>> higher time complexity. However, this should not pose a significant issue >>>>> in production, as the Merkle tree depth and the number of split ranges are >>>>> typically not large. >>>>> >>>>> >>>>> 1 Round Merkle Tree Building Complexity >>>>> Full Repair >>>>> MV Repair >>>>> Time complexity O(n) O(n*d*log(r)) >>>>> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r) >>>>> >>>>> *Tombstone:* >>>>> >>>>> The current proposal focuses on rebuilding the MV for a granular token >>>>> range where a mismatch is detected, rather than rebuilding the entire MV >>>>> token range. Since the MV is treated as a regular table, standard full or >>>>> incremental repair processes should still apply to both the base and MV >>>>> tables to keep their replicas in sync. >>>>> >>>>> Regarding tombstones, if we introduce special tombstone types or >>>>> handling mechanisms for the MV table, we may be able to support tombstone >>>>> synchronization between the base table and the MV. I plan to spend more >>>>> time exploring whether we can introduce changes to the base table that >>>>> enable this synchronization. >>>>> >>>>> >>>>> >>>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia < >>>>> chovatia.jayd...@gmail.com> wrote: >>>>> >>>>> >Like something doesn't add up here because if it always includes the >>>>> base table's primary key columns that means >>>>> >>>>> The requirement for materialized views (MVs) to include the base >>>>> table's primary key appears to be primarily a syntactic constraint >>>>> specific >>>>> to Apache Cassandra. For instance, in DynamoDB, the DDL for defining a >>>>> Global Secondary Index does not mandate inclusion of the base table's >>>>> primary key. This suggests that the syntax requirement in Cassandra could >>>>> potentially be relaxed in the future (outside the scope of this CEP). As >>>>> Benedict noted, the base table's primary key is optional when querying a >>>>> materialized view. >>>>> >>>>> Jaydeep >>>>> >>>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com> >>>>> wrote: >>>>> >>>>> >>>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation >>>>> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file. >>>>> > Repair isnt’ just about “have all transaction commits landed”. It’s >>>>> “is the data correct N days after it’s written”. >>>>> >>>>> Don't forget about restoring from a backup. >>>>> >>>>> Is there a way we could do some sort of hybrid compaction + >>>>> incremental repair? Maybe have the MV verify it's view while it's >>>>> compacting, and when it's done, mark the view's SSTable as repaired? Then >>>>> the repair process would only need to do a MV to MV repair. >>>>> >>>>> Jon >>>>> >>>>> >>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith < >>>>> bened...@apache.org> wrote: >>>>> >>>>> Like something doesn't add up here because if it always includes the >>>>> base table's primary key columns that means they could be storage attached >>>>> by just forbidding additional columns and there doesn't seem to be much >>>>> utility in including additional columns in the primary key? >>>>> >>>>> >>>>> You can re-order the keys, and they only need to be a part of the >>>>> primary key not the partition key. I think you can specify an arbitrary >>>>> order to the keys also, so you can change the effective sort order. So, >>>>> the >>>>> basic idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)). >>>>> >>>>> This is basically a global index, with the restriction on single >>>>> columns as keys only because we cannot cheaply read-before-write for >>>>> eventually consistent operations. This restriction can easily be relaxed >>>>> for Paxos and Accord based implementations, which can also safely include >>>>> additional keys. >>>>> >>>>> That said, I am not at all sure why they are called materialised views >>>>> if we don’t support including any other data besides the lookup column and >>>>> the primary key. We should really rename them once they work, both to make >>>>> some sense and to break with the historical baggage. >>>>> >>>>> I think this can be represented as a tombstone which can always be >>>>> fetched from the base table on read or maybe some other arrangement? I >>>>> agree it can't feasibly be represented as an enumeration of the deletions >>>>> at least not synchronously and doing it async has its own problems. >>>>> >>>>> If the base table must be read on read of an index/view, then I think >>>>> this proposal is approximately linearizable for the view as well (though, >>>>> I >>>>> do not at all warrant this statement). You still need to propagate this >>>>> eventually so that the views can cleanup. This also makes reads 2RT on >>>>> read, which is rather costly. >>>>> >>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I think it's worth taking a step back and looking at the current MV >>>>> restrictions which are pretty onerous. >>>>> >>>>> A view must have a primary key and that primary key must conform to >>>>> the following restrictions: >>>>> >>>>> - it must contain all the primary key columns of the base table. >>>>> This ensures that every row of the view correspond to exactly one row >>>>> of >>>>> the base table. >>>>> - it can only contain a single column that is not a primary key >>>>> column in the base table. >>>>> >>>>> At that point what exactly is the value in including anything except >>>>> the original primary key in the MV's primary key columns unless you are >>>>> using an ordered partitioner so you can iterate based on the leading >>>>> primary key columns? >>>>> >>>>> Like something doesn't add up here because if it always includes the >>>>> base table's primary key columns that means they could be storage attached >>>>> by just forbidding additional columns and there doesn't seem to be much >>>>> utility in including additional columns in the primary key? >>>>> >>>>> I'm not that clear on how much better it is to look something up in >>>>> the MV vs just looking at the base table or some non-materialized view of >>>>> it. How exactly are these MVs supposed to be used and what value do they >>>>> provide? >>>>> >>>>> Jeff Jirsa wrote: >>>>> >>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>> >>>>> >>>>> Runtian Liu pointed out that the CEP is sort of divided into two >>>>> parts. The first is the online part which is making reads/writes to MVs >>>>> safer and more reliable using a transaction system. The second is offline >>>>> which is repair. >>>>> >>>>> The story for the online portion I think is quite strong and worth >>>>> considering on its own merits. >>>>> >>>>> The offline portion (repair) sounds a little less feasible to run in >>>>> production, but I also think that MVs without any mechanism for checking >>>>> their consistency are not viable to run in production. So it's kind of pay >>>>> for what you use in terms of the feature? >>>>> >>>>> It's definitely worth thinking through if there is a way to fix one >>>>> side of this equation so it works better. >>>>> >>>>> David Capwell wrote: >>>>> >>>>> As far as I can tell, being based off Accord means you don’t need to >>>>> care about repair, as Accord will manage the consistency for you; you >>>>> can’t >>>>> get out of sync. >>>>> >>>>> I think a baseline requirement in C* for something to be in production >>>>> is to be able to run preview repair and validate that the transaction >>>>> system or any other part of Cassandra hasn't made a mistake. Divergence >>>>> can >>>>> have many sources including Accord. >>>>> >>>>> Runtian Liu wrote: >>>>> >>>>> For the example David mentioned, LWT cannot support. Since LWTs >>>>> operate on a single token, we’ll need to restrict base-table updates to >>>>> one >>>>> partition—and ideally one row—at a time. A current MV base-table command >>>>> can delete an entire partition, but doing so might touch hundreds of MV >>>>> partitions, making consistency guarantees impossible. >>>>> >>>>> I think this can be represented as a tombstone which can always be >>>>> fetched from the base table on read or maybe some other arrangement? I >>>>> agree it can't feasibly be represented as an enumeration of the deletions >>>>> at least not synchronously and doing it async has its own problems. >>>>> >>>>> Ariel >>>>> >>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote: >>>>> >>>>> >>>>> >>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>> >>>>> >>>>> I am *big* fan of getting repair really working with MVs. It does seem >>>>> problematic that the number of merkle trees will be equal to the number of >>>>> ranges in the cluster and repair of MVs would become an all node >>>>> operation. How would down nodes be handled and how many nodes would >>>>> simultaneously working to validate a given base table range at once? How >>>>> many base table ranges could simultaneously be repairing MVs? >>>>> >>>>> If a row containing a column that creates an MV partition is deleted, >>>>> and the MV isn't updated, then how does the merkle tree approach propagate >>>>> the deletion to the MV? The CEP says that anti-compaction would remove >>>>> extra rows, but I am not clear on how that works. When is anti-compaction >>>>> performed in the repair process and what is/isn't included in the outputs? >>>>> >>>>> >>>>> >>>>> I thought about these two points last night after I sent my email. >>>>> >>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>> >>>>> One is the lack of tombstones / deletions in the merle trees, which >>>>> makes properly dealing with writes/deletes/inconsistency very hard >>>>> (afaict) >>>>> >>>>> The second is the reality that repairing a single partition in the >>>>> base table may repair all hosts/ranges in the MV table, and vice versa. >>>>> Basically scanning either base or MV is effectively scanning the whole >>>>> cluster (modulo what you can avoid in the clean/dirty repaired sets). This >>>>> makes me really, really concerned with how it scales, and how likely it is >>>>> to be able to schedule automatically without blowing up. >>>>> >>>>> The paxos vs accord comments so far are interesting in that I think >>>>> both could be made to work, but I am very concerned about how the merkle >>>>> tree comparisons are likely to work with wide partitions leading to >>>>> massive >>>>> fanout in ranges. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>