[DISCUSS] Tooling to repair MV through a Spark job

Jaydeep Chovatia Fri, 06 Dec 2024 08:27:07 -0800

Hi,

*NOTE: *This email does not promote using Cassandra's Materialized View
(MV) but assists those stuck with it for various reasons.


The primary issue with MV is that once it goes out of sync with the base
table, no tooling is available to remediate it. This Spark job aims to fill
this gap by logically comparing the MV with the base table and identifying
inconsistencies. The job primarily does the following:

   - Scans Base Table (A), MV (B), and do {A}-{B} analysis
   - Categorize each record into one of the four areas: a) Consistent, b)
   Inconsistent, c) MissingInMV, d) MissingInBaseTable
   - Provide a detailed view of mismatches, such as the primary key, all
   the non-primary key fields, and mismatched columns.
   - Dumps the detailed information to an output folder path provided to
   the job (one can extend the interface to dump the records to some object
   store as well)
   - Optionally, the job fixes the MV inconsistencies.
   - Rich configuration (throttling, actionable output, capability to
   specify the time range for the records, etc.) to run the job at Scale in a
   production environment

Design doc: link
<https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
The Git Repository: link
<https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>

*Motivation*

   1. This email's primary objective is to share with the community that
   something like this is available for MV (in a private repository), which
   may be helpful in emergencies to folks stuck with MV in production.
   2. If we, as a community, want to officially foster tooling using Spark
   because it can be helpful to do many things beyond the MV work, such as
   counting rows, etc., then I am happy to drive the efforts.

Please let me know what you think.

Jaydeep

[DISCUSS] Tooling to repair MV through a Spark job

Reply via email to