Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Jon Haddad Thu, 15 May 2025 18:42:05 -0700

One last thing.  I'm pretty sure building the tree requires the keys be
added in token order:
https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173


Which definitely introduces a bit of a problem, given that the tree would
be constructed from the transformed v1, which is a value unpredictable
enough to be considered random.

The only way I can think of to address this would be to maintain a local
index on v1.  See my previous email where I mentioned this.

Base Table -> Local Index -> Global Index

Still a really hard problem.

Jon



On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com> wrote:

> There's a lot here that's still confusing to me.  Maybe you can help me
> understand it better?  Apologies in advance for the text wall :)
>
> I'll use this schema as an example:
>
> ---------
> CREATE TABLE test.t1 (
>     id int PRIMARY KEY,
>     v1 int
> );
>
> create MATERIALIZED VIEW  test_mv as
> SELECT v1, id from test.t1 where id is not null and v1 is not null primary
> key (v1, id);
> ---------
>
> We've got (id, v1) in the base table and (v1, id) in the MV.
>
> During the repair, we snapshot, and construct a whole bunch of merkle
> trees.  CEP-48 says they will be persisted to disk.
>
> ** *Do you intend on building all the Merkle trees in parallel?
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
> ** *Have you calculated the Merkle tree space requirements?
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> I'm very curious to hear if anyone has run a full cluster repair recently
> on a non-trivial dataset.  Every cluster I work with only does subrange
> repair.  I can't even recall the last time I did a full repair on a large
> cluster.  I may never have, now that I think about it.  Every time I've
> done this in the past it's been plagued with issues, both in terms of
> performance and reliability.  Subrange repair works because it can make
> progress in 15-30 minute increments.
>
> Anyways - moving on...
>
> You suggest we read the base table and construct the Merkle trees based on
> the transformed rows. Using my schema above, we take the v1 field and use
> token(v1), to build the tree.  Assuming that a value for v1 appears many
> times throughout the dataset across many partitions, how do you intend on
> calculating it's hash?  If you look at Validator.rowHash [1] and
> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
> entire partition and calculates the hash based on that.  Here's the comment:
>
>  /**
>      * Called (in order) for every row present in the CF.
>      * Hashes the row, and adds it to the tree being built.
>      *
>      * @param partition Partition to add hash
>      */
>     public void add(UnfilteredRowIterator partition)
>
> So it seems to me like you need to have the entire partition materialized
> in memory before adding to the tree.    Doing that per value v1 without an
> index is pretty much impossible - we'd have to scan the entire dataset once
> per partition to pull out all the matching v1 values, or you'd need to
> materialize the entire dataset into a local version of the MV for that
> range. I don't know how you could do this.  Do you have a workaround for
> this planned?  Maybe someone that knows the Merkle tree code better can
> chime in.
>
> Maybe there's something else here I'm not aware of - please let me know
> what I'm missing here if I am, it would be great to see this in the doc if
> you have a solution.
>
> For the sake of discussion, let's assume we've moved past this and we have
> our tree for a hundreds of ranges built from the base table & built for the
> MV, now we move onto the comparison.
>
> In the doc at this point, we delete the snapshot because we have the tree
> structures and we compare Merkle trees.  Then we stream mismatched data.
>
> So let's say we find a mismatch in a hash.  That indicates that there's
> some range of data where we have an issue.  For some token range calculated
> from the v1 field, we have a mismatch, right?  What do we do with that
> information?
>
> * Do we tell the node that owned the base table - hey, stream the data
> from base where token(v1) is in range [X,Y) to me?
> * That means we have to scan through the base again for all rows where
> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
> v1, we're doing a full table scan and hashing every v1 value to find out if
> it needs to be streamed back to the MV.
> * Are we doing this concurrently on all nodes?
> * Will there be coordination between all nodes in the cluster to ensure
> you don't have to do multiple scans?
>
> I realized there's a lot of questions here, but unfortunately I'm having a
> hard time seeing how we can workaround some of the core assumptions around
> constructing Merkle trees and using them to resolve the differences in a
> way that matches up with what's in the doc.  I have quite a few more things
> to discuss, but I'll save them for a follow up once all these have been
> sorted out.
>
> Thanks in advance!
> Jon
>
> [1]
> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209
>
>
>
> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote:
>
>> The previous table compared the complexity of full repair and MV repair
>> when reconciling one dataset with another. In production, we typically use
>> a replication factor of 3 in one datacenter. This means full repair
>> involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
>> Below is an updated comparison table reflecting this scenario.
>>
>> n: number of rows to repair (Total rows in the table)
>>
>> d: depth of one Merkle tree for MV repair
>>
>> r: number of split ranges
>>
>> p: data compacted away
>>
>>
>> This comparison focuses on the complexities of one round of full repair
>> with a replication factor of 3 versus repairing a single MV based on one
>> base table with replication factor 3.
>>
>> Full Repair
>>
>> MV Repair
>>
>> Comment
>>
>> Extra disk used
>>
>> 0
>>
>> O(2*p)
>>
>> Since we take a snapshot at the beginning of the repair, any disk space
>> that would normally be freed by compaction will remain occupied until the
>> Merkle trees are successfully built and the snapshot is cleared.
>>
>> Data scan complexity
>>
>> O(3*n)
>>
>> O(6*n)
>>
>> Full repair scans n rows from the primary and 2n from replicas.3
>>
>> MV repair scans 3n rows from the base table and 3n from the MV.
>>
>> Merkle Tree building time complexity
>>
>> O(3n)
>>
>> O(6*n*d)
>>
>> In full repair, Merkle tree building is O(1) per row—each hash is added
>> sequentially to the leaf nodes.
>>
>> In MV repair, each hash is inserted from the root, making it O(d) per
>> row. Since d is typically small (less than 20 and often smaller than in
>> full repair), this isn’t a major concern.
>>
>> Total Merkle tree count
>>
>> O(3*r)
>>
>> O(6*r^2)
>>
>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
>> concern as they can be persisted to disk during the repair process.
>>
>> Merkle tree comparison complexity
>>
>> O(3n)
>>
>> O(3n)
>>
>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>
>> Stream time complexity
>>
>> O(3n)
>>
>> O(3n)
>>
>> Assuming all rows need to be streamed, both repairs are equivalent.
>>
>> In short: Even for production use cases having RF=3 in one data center,
>> we can see that the MV repair consumes temporary disk space and a small,
>> usually negligible amount of extra CPU for tree construction; other costs
>> match full repair.
>>
>> Additionally, with the online path proposed in this CEP, we expect
>> mismatches to be rare, which can lower the frequency of running this repair
>> process compared to full repair.
>>
>>
>> On Thu, May 15, 2025 at 9:53 AM Jon Haddad <j...@rustyrazorblade.com>
>> wrote:
>>
>>> > They are not two unordered sets, but rather two sets ordered by
>>> different keys.
>>>
>>> I think this is a distinction without a difference. Merkle tree repair
>>> works because the ordering of the data is mostly the same across nodes.
>>>
>>>
>>> On Thu, May 15, 2025 at 9:27 AM Runtian Liu <curly...@gmail.com> wrote:
>>>
>>>> > what we're trying to achieve here is comparing two massive unordered
>>>> sets.
>>>>
>>>> They are not two unordered sets, but rather two sets ordered by
>>>> different keys. This means that when building Merkle trees for the base
>>>> table and the materialized view (MV), we need to use different strategies
>>>> to ensure the trees can be meaningfully compared.
>>>>
>>>> To address scalability concerns for MV repair, I’ve included a
>>>> comparison between one round of full repair and MV repair in the table
>>>> below. This comparison is also added to the CEP.
>>>>
>>>> n: number of rows to repair (Total rows in the table)
>>>>
>>>> d: depth of one Merkle tree for MV repair
>>>>
>>>> r: number of split ranges
>>>>
>>>> p: data compacted away
>>>>
>>>>
>>>> This comparison focuses on the complexities of one round of full repair
>>>> with a replication factor of 2 versus repairing a single MV based on one
>>>> base table replica.
>>>>
>>>> Full Repair
>>>>
>>>> MV Repair
>>>>
>>>> Comment
>>>>
>>>> Extra disk used
>>>>
>>>> 0
>>>>
>>>> O(2*p)
>>>>
>>>> Since we take a snapshot at the beginning of the repair, any disk space
>>>> that would normally be freed by compaction will remain occupied until the
>>>> Merkle trees are successfully built and the snapshot is cleared.
>>>>
>>>> Data scan complexity
>>>>
>>>> O(2*n)
>>>>
>>>> O(2*n)
>>>>
>>>> Full repair scans n rows from the primary and n from replicas.
>>>>
>>>> MV repair scans n rows from the base table primary replica only, and n
>>>> from the MV primary replica only.
>>>>
>>>> Merkle Tree building time complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n*d)
>>>>
>>>> In full repair, Merkle tree building is O(1) per row—each hash is
>>>> added sequentially to the leaf nodes.
>>>>
>>>> In MV repair, each hash is inserted from the root, making it O(d) per
>>>> row. Since d is typically small (less than 20 and often smaller than
>>>> in full repair), this isn’t a major concern.
>>>>
>>>> Total Merkle tree count
>>>>
>>>> O(2*r)
>>>>
>>>> O(2*r^2)
>>>>
>>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t
>>>> a concern as they can be persisted to disk during the repair process.
>>>>
>>>> Merkle tree comparison complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n)
>>>>
>>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>>
>>>> Stream time complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n)
>>>>
>>>> Assuming all rows need to be streamed, both repairs are equivalent.
>>>>
>>>> In short: MV repair consumes temporary disk space and a small, usually
>>>> negligible amount of extra CPU for tree construction; other costs match
>>>> full repair.
>>>>
>>>> The core idea behind the proposed MV repair is as follows:
>>>>
>>>>    1.
>>>>
>>>>    Take a snapshot to “freeze” the current state of both the base
>>>>    table and its MV.
>>>>    2.
>>>>
>>>>    Gradually scan the data from both tables to build Merkle trees.
>>>>    3.
>>>>
>>>>    Identify the token ranges where inconsistencies exist.
>>>>    4.
>>>>
>>>>    Rebuild only the mismatched ranges rather than the entire MV.
>>>>
>>>> With transaction-backed MVs, step 4 should rarely be necessary.
>>>>
>>>> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie <jmcken...@apache.org>
>>>> wrote:
>>>>
>>>>> I think in order to address this, the view should be propagated to the
>>>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>>>> This is where I think mutation tracking could probably help.
>>>>>
>>>>> Yeah, the idea of "don't reflect in the MV until you hit the CL the
>>>>> user requested for the base table". Introduces disjoint risk if you have
>>>>> coordinator death mid-write where replicas got base-data but that 2nd step
>>>>> didn't take place; think that's why Runtien et. al are looking at paxos
>>>>> repair picking up those pieces for you after the fact to get you back into
>>>>> consistency. Mutation tracking and Accord both have similar guarantees in
>>>>> this space.
>>>>>
>>>>> I think this would ensure that as long as there's no data loss or
>>>>> bit-rot, the base and view can be repaired independently. When there is
>>>>> data loss or bit-rot in either the base table or the view, then it is the
>>>>> same as 2i today: rebuild is required.
>>>>>
>>>>> And the repair as proposed in the CEP should resolve the bitrot and
>>>>> bug dataloss case I think. Certainly has much higher time complexity but
>>>>> the bounding of memory complexity to be comparable with regular repair
>>>>> doesn't strike me as a dealbreaker.
>>>>>
>>>>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>>>>>
>>>>> >  I think requiring a rebuild is a deal breaker for most teams.  In
>>>>> most instances it would be having to also expand the cluster to handle the
>>>>> additional disk requirements.  It turns an inconsistency problem into a
>>>>> major operational headache that can take weeks to resolve.
>>>>>
>>>>> Agreed. The rebuild would not be required during normal operations
>>>>> when the cluster is properly maintained (ie. regular repair) - only in
>>>>> catastrophic situations.   This is also the case for ordinary tables
>>>>> currently: if there's data loss, then restoring from a backup is needed.
>>>>> This could be a possible alternative to not require a rebuild in this
>>>>> extraordinary scenario.
>>>>>
>>>>> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> I think requiring a rebuild is a deal breaker for most teams.  In most
>>>>> instances it would be having to also expand the cluster to handle the
>>>>> additional disk requirements.  It turns an inconsistency problem into a
>>>>> major operational headache that can take weeks to resolve.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > There's bi-directional entropy issues with MV's - either orphaned
>>>>> view data or missing view data; that's why you kind of need a
>>>>> "bi-directional ETL" to make sure the 2 agree with each other. While 
>>>>> normal
>>>>> repair would resolve the "missing data in MV" case, it wouldn't resolve 
>>>>> the
>>>>> "data in MV that's not in base table anymore" case, which afaict all base
>>>>> consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) 
>>>>> are
>>>>> vulnerable to.
>>>>>
>>>>> I don't think that bi-directional reconciliation should be a
>>>>> requirement, when the base table is assumed to be the source of truth as
>>>>> stated in the CEP doc.
>>>>>
>>>>> I think the main issue with the current MV implementation is that each
>>>>> view replica is independently replicated by the base replica, before the
>>>>> base write is acknowledged.
>>>>>
>>>>> This creates a correctness issue in the write path, because a view
>>>>> update can be created for a write that was not accepted by the coordinator
>>>>> in the following scenario:
>>>>>
>>>>> N=RF=3
>>>>> CL=ONE
>>>>> - Update U is propagated to view replica V, coordinator that is also
>>>>> base replica B dies before accepting base table write request to client.
>>>>> Now U exists in V but not in B.
>>>>>
>>>>> I think in order to address this, the view should be propagated to the
>>>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>>>> This is where I think mutation tracking could probably help.
>>>>>
>>>>> I think this would ensure that as long as there's no data loss or
>>>>> bit-rot, the base and view can be repaired independently. When there is
>>>>> data loss or bit-rot in either the base table or the view, then it is the
>>>>> same as 2i today: rebuild is required.
>>>>>
>>>>> >  It'd be correct (if operationally disappointing) to be able to just
>>>>> say "if you have data loss in your base table you need to rebuild the
>>>>> corresponding MV's", but the problem is operators aren't always going to
>>>>> know when that data loss occurs. Not everything is as visible as a lost
>>>>> quorum of replicas or blown up SSTables.
>>>>>
>>>>> I think there are opportunities to improve rebuild speed, assuming the
>>>>> base table as a source of truth. For example, rebuild only subranges when
>>>>> data-loss is detected.
>>>>>
>>>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org>
>>>>> wrote:
>>>>>
>>>>>
>>>>> There's bi-directional entropy issues with MV's - either orphaned view
>>>>> data or missing view data; that's why you kind of need a "bi-directional
>>>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>>>> that's not in base table anymore" case, which afaict all base consistency
>>>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>>>> to.
>>>>>
>>>>> It'd be correct (if operationally disappointing) to be able to just
>>>>> say "if you have data loss in your base table you need to rebuild the
>>>>> corresponding MV's", but the problem is operators aren't always going to
>>>>> know when that data loss occurs. Not everything is as visible as a lost
>>>>> quorum of replicas or blown up SSTables.
>>>>>
>>>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>>>
>>>>> Maybe, I’m not really familiar enough with how “classic” MV repair
>>>>> works to say. You can’t mix normal repair and mutation reconciliation in
>>>>> the current incarnation of mutation tracking though, so I wouldn’t assume
>>>>> it would work with MVs.
>>>>>
>>>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>>>
>>>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>>>> (just the MV against the other nodes) resolve the issue?
>>>>>
>>>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Mutation tracking is definitely an approach you could take for MVs.
>>>>> Mutation reconciliation could be extended to ensure all changes have been
>>>>> replicated to the views. When a base table received a mutation w/ an id it
>>>>> would generate a view update. If you block marking a given mutation id as
>>>>> reconciled until it’s been fully replicated to the base table and its view
>>>>> updates have been fully replicated to the views, then all view updates 
>>>>> will
>>>>> eventually be applied as part of the log reconciliation process.
>>>>>
>>>>> A mutation tracking implementation would also allow you to be more
>>>>> flexible with the types of consistency levels you can work with, allowing
>>>>> users to do things like use LOCAL_QUORUM without leaving themselves open 
>>>>> to
>>>>> introducing view inconsistencies.
>>>>>
>>>>> That would more or less eliminate the need for any MV repair in normal
>>>>> usage, but wouldn't address how to repair issues caused by bugs or data
>>>>> loss, though you may be able to do something with comparing the latest
>>>>> mutation ids for the base tables and its view ranges.
>>>>>
>>>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>>>>
>>>>> I don't see mutation tracking [1] mentioned in this thread or in the
>>>>> CEP-48 description. Not sure this would fit into the scope of this
>>>>> initial CEP, but I have a feeling that mutation tracking could be
>>>>> potentially helpful to reconcile base tables and views ?
>>>>>
>>>>> For example, when both base and view updates are acknowledged then
>>>>> this could be somehow persisted in the view sstables mutation tracking
>>>>> summary[2] or similar metadata ? Then these updates would be skipped 
>>>>> during
>>>>> view repair, considerably reducing the amount of work needed, since only
>>>>> un-acknowledged views updates would need to be reconciled.
>>>>>
>>>>> [1] -
>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>>>> [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336
>>>>>
>>>>> On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > - The first thing I notice is that we're talking about repairing the
>>>>> entire table across the entire cluster all in one go.  It's been a *long*
>>>>> time since I tried to do a full repair of an entire table without using
>>>>> sub-ranges.  Is anyone here even doing that with clusters of non-trivial
>>>>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>>>>> take even in the best case scenario?
>>>>>
>>>>> I haven't checked the CEP yet so I may be missing out something but I
>>>>> think this effort doesn't need to be conflated with dense node support, to
>>>>> make this more approachable. I think prospective users would be OK with
>>>>> overprovisioning to make this feasible if needed. We could perhaps have
>>>>> size guardrails that limit the maximum table size per node when MVs are
>>>>> enabled. Ideally we should make it work for dense nodes if possible, but
>>>>> this shouldn't be a reason not to support the feature if it can be made to
>>>>> work reasonably with more resources.
>>>>>
>>>>> I think the main issue with the current MV is about correctness, and
>>>>> the ultimate goal of the CEP must be to provide correctness guarantees,
>>>>> even if it has an inevitable performance hit. I think that the performance
>>>>> of the repair process is definitely an important consideration and it 
>>>>> would
>>>>> be helpful to have some benchmarks to have an idea of how long this repair
>>>>> process would take for lightweight and denser tables.
>>>>>
>>>>> On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> I've got several concerns around this repair process.
>>>>>
>>>>> - The first thing I notice is that we're talking about repairing the
>>>>> entire table across the entire cluster all in one go.  It's been a *long*
>>>>> time since I tried to do a full repair of an entire table without using
>>>>> sub-ranges.  Is anyone here even doing that with clusters of non trivial
>>>>> size?  How long does a full repair of a 100 node cluster with 5TB / node
>>>>> take even in the best case scenario?
>>>>>
>>>>> - Even in a scenario where sub-range repair is supported, you'd have
>>>>> to scan *every* sstable on the base table in order to construct the a
>>>>> merkle tree, as we don't know in advance which SSTables contain the ranges
>>>>> that the MV will.  That means a subrange repair would have to do a *ton* 
>>>>> of
>>>>> IO.  Anyone who's mis-configured a sub-range incremental repair to use too
>>>>> many ranges will probably be familiar with how long it can take to
>>>>> anti-compact a bunch of SSTables.  With MV sub-range repair, we'd have 
>>>>> even
>>>>> more overhead, because we'd have to read in every SSTable, every time.  If
>>>>> we do 10 subranges, we'll do 10x the IO of a normal repair.  I don't think
>>>>> this is practical.
>>>>>
>>>>> - Merkle trees make sense when you're comparing tables with the same
>>>>> partition key, but I don't think they do when you're transforming a base
>>>>> table to a view.  When there's a mis-match, what's transferred?  We have a
>>>>> range of data in the MV, but now we have to go find that from the base
>>>>> table.  That means the merkle tree needs to not just track the hashes and
>>>>> ranges, but the original keys it was transformed from, in order to go find
>>>>> all of the matching partitions in that mis-matched range.  Either that or
>>>>> we end up rescanning the entire dataset in order to find the mismatches.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Looking at the details of the CEP it seems to describe Paxos as
>>>>> PaxosV1, but PaxosV2 works slightly differently (it can read during the
>>>>> prepare phase). I assume that supporting Paxos means supporting both V1 
>>>>> and
>>>>> V2 for materialized views?
>>>>> We are going to support Paxos V2. The CEP is not clear on that, we add
>>>>> this to clarify that.
>>>>>
>>>>> It looks like the online portion is now fairly well understood.  For
>>>>> the offline repair part, I see two main concerns: one around the
>>>>> scalability of the proposed approach, and another regarding how it handles
>>>>> tombstones.
>>>>>
>>>>> *Scalability:*
>>>>> I have added a *section*
>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample>
>>>>> in the CEP with an example to compare full repair and the proposed MV
>>>>> repair, the overall scalability should not be a problem.
>>>>>
>>>>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes,
>>>>> where each node owns one token. The base table uses (pk, ck) as its 
>>>>> primary
>>>>> key, while the materialized view (MV) uses (ck, pk) as its primary key.
>>>>> Both tables include a value column v, which allows us to correlate rows
>>>>> between them. The dataset consists of 16 records, distributed as follows:
>>>>>
>>>>>
>>>>> *Base table*
>>>>> (pk, ck, v)
>>>>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1
>>>>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2
>>>>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3
>>>>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4
>>>>>
>>>>>
>>>>>
>>>>> *Materialized view*
>>>>> (ck, pk, v)
>>>>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1
>>>>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2
>>>>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3
>>>>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4
>>>>>
>>>>>
>>>>> The chart below compares one round of full repair with one round of MV
>>>>> repair. As shown, both scan the same total number of rows. However, MV
>>>>> repair has higher time complexity because its Merkle tree processes each
>>>>> row more intensively. To avoid all nodes scanning the entire table
>>>>> simultaneously, MV repair should use a snapshot-based approach, similar to
>>>>> normal repair with the --sequential option. Time complexity increase
>>>>> compare to full repair can be found in the "Complexity and Memory
>>>>> Management" section.
>>>>>
>>>>>
>>>>> n: number of rows
>>>>>
>>>>> d: depth of one Merkle tree for MV repair
>>>>>
>>>>> d': depth of one Merkle tree for full repair
>>>>>
>>>>> r: number of split ranges
>>>>>
>>>>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r.
>>>>>
>>>>> We can see that the space complexity is the same, while MV repair has
>>>>> higher time complexity. However, this should not pose a significant issue
>>>>> in production, as the Merkle tree depth and the number of split ranges are
>>>>> typically not large.
>>>>>
>>>>>
>>>>> 1 Round Merkle Tree Building Complexity
>>>>> Full Repair
>>>>> MV Repair
>>>>> Time complexity O(n) O(n*d*log(r))
>>>>> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r)
>>>>>
>>>>> *Tombstone:*
>>>>>
>>>>> The current proposal focuses on rebuilding the MV for a granular token
>>>>> range where a mismatch is detected, rather than rebuilding the entire MV
>>>>> token range. Since the MV is treated as a regular table, standard full or
>>>>> incremental repair processes should still apply to both the base and MV
>>>>> tables to keep their replicas in sync.
>>>>>
>>>>> Regarding tombstones, if we introduce special tombstone types or
>>>>> handling mechanisms for the MV table, we may be able to support tombstone
>>>>> synchronization between the base table and the MV. I plan to spend more
>>>>> time exploring whether we can introduce changes to the base table that
>>>>> enable this synchronization.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia <
>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>
>>>>> >Like something doesn't add up here because if it always includes the
>>>>> base table's primary key columns that means
>>>>>
>>>>> The requirement for materialized views (MVs) to include the base
>>>>> table's primary key appears to be primarily a syntactic constraint 
>>>>> specific
>>>>> to Apache Cassandra. For instance, in DynamoDB, the DDL for defining a
>>>>> Global Secondary Index does not mandate inclusion of the base table's
>>>>> primary key. This suggests that the syntax requirement in Cassandra could
>>>>> potentially be relaxed in the future (outside the scope of this CEP). As
>>>>> Benedict noted, the base table's primary key is optional when querying a
>>>>> materialized view.
>>>>>
>>>>> Jaydeep
>>>>>
>>>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation
>>>>> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file.
>>>>> > Repair isnt’ just about “have all transaction commits landed”. It’s
>>>>> “is the data correct N days after it’s written”.
>>>>>
>>>>> Don't forget about restoring from a backup.
>>>>>
>>>>> Is there a way we could do some sort of hybrid compaction +
>>>>> incremental repair?  Maybe have the MV verify it's view while it's
>>>>> compacting, and when it's done, mark the view's SSTable as repaired?  Then
>>>>> the repair process would only need to do a MV to MV repair.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith <
>>>>> bened...@apache.org> wrote:
>>>>>
>>>>> Like something doesn't add up here because if it always includes the
>>>>> base table's primary key columns that means they could be storage attached
>>>>> by just forbidding additional columns and there doesn't seem to be much
>>>>> utility in including additional columns in the primary key?
>>>>>
>>>>>
>>>>> You can re-order the keys, and they only need to be a part of the
>>>>> primary key not the partition key. I think you can specify an arbitrary
>>>>> order to the keys also, so you can change the effective sort order. So, 
>>>>> the
>>>>> basic idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).
>>>>>
>>>>> This is basically a global index, with the restriction on single
>>>>> columns as keys only because we cannot cheaply read-before-write for
>>>>> eventually consistent operations. This restriction can easily be relaxed
>>>>> for Paxos and Accord based implementations, which can also safely include
>>>>> additional keys.
>>>>>
>>>>> That said, I am not at all sure why they are called materialised views
>>>>> if we don’t support including any other data besides the lookup column and
>>>>> the primary key. We should really rename them once they work, both to make
>>>>> some sense and to break with the historical baggage.
>>>>>
>>>>> I think this can be represented as a tombstone which can always be
>>>>> fetched from the base table on read or maybe some other arrangement? I
>>>>> agree it can't feasibly be represented as an enumeration of the deletions
>>>>> at least not synchronously and doing it async has its own problems.
>>>>>
>>>>> If the base table must be read on read of an index/view, then I think
>>>>> this proposal is approximately linearizable for the view as well (though, 
>>>>> I
>>>>> do not at all warrant this statement). You still need to propagate this
>>>>> eventually so that the views can cleanup. This also makes reads 2RT on
>>>>> read, which is rather costly.
>>>>>
>>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I think it's worth taking a step back and looking at the current MV
>>>>> restrictions which are pretty onerous.
>>>>>
>>>>> A view must have a primary key and that primary key must conform to
>>>>> the following restrictions:
>>>>>
>>>>>    - it must contain all the primary key columns of the base table.
>>>>>    This ensures that every row of the view correspond to exactly one row 
>>>>> of
>>>>>    the base table.
>>>>>    - it can only contain a single column that is not a primary key
>>>>>    column in the base table.
>>>>>
>>>>> At that point what exactly is the value in including anything except
>>>>> the original primary key in the MV's primary key columns unless you are
>>>>> using an ordered partitioner so you can iterate based on the leading
>>>>> primary key columns?
>>>>>
>>>>> Like something doesn't add up here because if it always includes the
>>>>> base table's primary key columns that means they could be storage attached
>>>>> by just forbidding additional columns and there doesn't seem to be much
>>>>> utility in including additional columns in the primary key?
>>>>>
>>>>> I'm not that clear on how much better it is to look something up in
>>>>> the MV vs just looking at the base table or some non-materialized view of
>>>>> it. How exactly are these MVs supposed to be used and what value do they
>>>>> provide?
>>>>>
>>>>> Jeff Jirsa wrote:
>>>>>
>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>
>>>>>
>>>>> Runtian Liu pointed out that the CEP is sort of divided into two
>>>>> parts. The first is the online part which is making reads/writes to MVs
>>>>> safer and more reliable using a transaction system. The second is offline
>>>>> which is repair.
>>>>>
>>>>> The story for the online portion I think is quite strong and worth
>>>>> considering on its own merits.
>>>>>
>>>>> The offline portion (repair) sounds a little less feasible to run in
>>>>> production, but I also think that MVs without any mechanism for checking
>>>>> their consistency are not viable to run in production. So it's kind of pay
>>>>> for what you use in terms of the feature?
>>>>>
>>>>> It's definitely worth thinking through if there is a way to fix one
>>>>> side of this equation so it works better.
>>>>>
>>>>> David Capwell wrote:
>>>>>
>>>>> As far as I can tell, being based off Accord means you don’t need to
>>>>> care about repair, as Accord will manage the consistency for you; you 
>>>>> can’t
>>>>> get out of sync.
>>>>>
>>>>> I think a baseline requirement in C* for something to be in production
>>>>> is to be able to run preview repair and validate that the transaction
>>>>> system or any other part of Cassandra hasn't made a mistake. Divergence 
>>>>> can
>>>>> have many sources including Accord.
>>>>>
>>>>> Runtian Liu wrote:
>>>>>
>>>>> For the example David mentioned, LWT cannot support. Since LWTs
>>>>> operate on a single token, we’ll need to restrict base-table updates to 
>>>>> one
>>>>> partition—and ideally one row—at a time. A current MV base-table command
>>>>> can delete an entire partition, but doing so might touch hundreds of MV
>>>>> partitions, making consistency guarantees impossible.
>>>>>
>>>>> I think this can be represented as a tombstone which can always be
>>>>> fetched from the base table on read or maybe some other arrangement? I
>>>>> agree it can't feasibly be represented as an enumeration of the deletions
>>>>> at least not synchronously and doing it async has its own problems.
>>>>>
>>>>> Ariel
>>>>>
>>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>>>>>
>>>>>
>>>>>
>>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>>
>>>>>
>>>>> I am *big* fan of getting repair really working with MVs. It does seem
>>>>> problematic that the number of merkle trees will be equal to the number of
>>>>> ranges in the cluster and repair of MVs would become an all node
>>>>> operation.  How would down nodes be handled and how many nodes would
>>>>> simultaneously working to validate a given base table range at once? How
>>>>> many base table ranges could simultaneously be repairing MVs?
>>>>>
>>>>> If a row containing a column that creates an MV partition is deleted,
>>>>> and the MV isn't updated, then how does the merkle tree approach propagate
>>>>> the deletion to the MV? The CEP says that anti-compaction would remove
>>>>> extra rows, but I am not clear on how that works. When is anti-compaction
>>>>> performed in the repair process and what is/isn't included in the outputs?
>>>>>
>>>>>
>>>>>
>>>>> I thought about these two points last night after I sent my email.
>>>>>
>>>>> There’s 2 things in this proposal that give me a lot of pause.
>>>>>
>>>>> One is the lack of tombstones / deletions in the merle trees, which
>>>>> makes properly dealing with writes/deletes/inconsistency very hard 
>>>>> (afaict)
>>>>>
>>>>> The second is the reality that repairing a single partition in the
>>>>> base table may repair all hosts/ranges in the MV table, and vice versa.
>>>>> Basically scanning either base or MV is effectively scanning the whole
>>>>> cluster (modulo what you can avoid in the clean/dirty repaired sets). This
>>>>> makes me really, really concerned with how it scales, and how likely it is
>>>>> to be able to schedule automatically without blowing up.
>>>>>
>>>>> The paxos vs accord comments so far are interesting in that I think
>>>>> both could be made to work, but I am very concerned about how the merkle
>>>>> tree comparisons are likely to work with wide partitions leading to 
>>>>> massive
>>>>> fanout in ranges.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to