Hi, *NOTE: *This email does not promote using Cassandra's Materialized View (MV) but assists those stuck with it for various reasons.
The primary issue with MV is that once it goes out of sync with the base table, no tooling is available to remediate it. This Spark job aims to fill this gap by logically comparing the MV with the base table and identifying inconsistencies. The job primarily does the following: - Scans Base Table (A), MV (B), and do {A}-{B} analysis - Categorize each record into one of the four areas: a) Consistent, b) Inconsistent, c) MissingInMV, d) MissingInBaseTable - Provide a detailed view of mismatches, such as the primary key, all the non-primary key fields, and mismatched columns. - Dumps the detailed information to an output folder path provided to the job (one can extend the interface to dump the records to some object store as well) - Optionally, the job fixes the MV inconsistencies. - Rich configuration (throttling, actionable output, capability to specify the time range for the records, etc.) to run the job at Scale in a production environment Design doc: link <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing> The Git Repository: link <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> *Motivation* 1. This email's primary objective is to share with the community that something like this is available for MV (in a private repository), which may be helpful in emergencies to folks stuck with MV in production. 2. If we, as a community, want to officially foster tooling using Spark because it can be helpful to do many things beyond the MV work, such as counting rows, etc., then I am happy to drive the efforts. Please let me know what you think. Jaydeep