Re: [DISCUSS] Tooling to repair MV through a Spark job

Yifan Cai Fri, 06 Dec 2024 15:58:13 -0800

Oh, I just noticed that James already mentioned it.

On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai <yc25c...@gmail.com> wrote:


> I would like to highlight an existing tooling for "many things beyond the
> MV work, such as counting rows, etc."
>
> The Apache Cassandra Analytics project (
> http://github.com/apache/cassandra-analytics/) could be a great resource
> for this type of task. It reads directly from the SSTables in the Spark
> executors, which avoids sending CQL queries that cloud stress the cluster
> or interfere with the production traffic.
>
> - Yifan
>
> On Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hi,
>>
>> *NOTE: *This email does not promote using Cassandra's Materialized View
>> (MV) but assists those stuck with it for various reasons.
>>
>> The primary issue with MV is that once it goes out of sync with the base
>> table, no tooling is available to remediate it. This Spark job aims to fill
>> this gap by logically comparing the MV with the base table and identifying
>> inconsistencies. The job primarily does the following:
>>
>>    - Scans Base Table (A), MV (B), and do {A}-{B} analysis
>>    - Categorize each record into one of the four areas: a) Consistent,
>>    b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
>>    - Provide a detailed view of mismatches, such as the primary key, all
>>    the non-primary key fields, and mismatched columns.
>>    - Dumps the detailed information to an output folder path provided to
>>    the job (one can extend the interface to dump the records to some object
>>    store as well)
>>    - Optionally, the job fixes the MV inconsistencies.
>>    - Rich configuration (throttling, actionable output, capability to
>>    specify the time range for the records, etc.) to run the job at Scale in a
>>    production environment
>>
>> Design doc: link
>> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
>> The Git Repository: link
>> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
>>
>> *Motivation*
>>
>>    1. This email's primary objective is to share with the community that
>>    something like this is available for MV (in a private repository), which
>>    may be helpful in emergencies to folks stuck with MV in production.
>>    2. If we, as a community, want to officially foster tooling using
>>    Spark because it can be helpful to do many things beyond the MV work, such
>>    as counting rows, etc., then I am happy to drive the efforts.
>>
>> Please let me know what you think.
>>
>> Jaydeep
>>
>

Re: [DISCUSS] Tooling to repair MV through a Spark job

Reply via email to