Oh, I just noticed that James already mentioned it. On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai <yc25c...@gmail.com> wrote:
> I would like to highlight an existing tooling for "many things beyond the > MV work, such as counting rows, etc." > > The Apache Cassandra Analytics project ( > http://github.com/apache/cassandra-analytics/) could be a great resource > for this type of task. It reads directly from the SSTables in the Spark > executors, which avoids sending CQL queries that cloud stress the cluster > or interfere with the production traffic. > > - Yifan > > On Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia < > chovatia.jayd...@gmail.com> wrote: > >> Hi, >> >> *NOTE: *This email does not promote using Cassandra's Materialized View >> (MV) but assists those stuck with it for various reasons. >> >> The primary issue with MV is that once it goes out of sync with the base >> table, no tooling is available to remediate it. This Spark job aims to fill >> this gap by logically comparing the MV with the base table and identifying >> inconsistencies. The job primarily does the following: >> >> - Scans Base Table (A), MV (B), and do {A}-{B} analysis >> - Categorize each record into one of the four areas: a) Consistent, >> b) Inconsistent, c) MissingInMV, d) MissingInBaseTable >> - Provide a detailed view of mismatches, such as the primary key, all >> the non-primary key fields, and mismatched columns. >> - Dumps the detailed information to an output folder path provided to >> the job (one can extend the interface to dump the records to some object >> store as well) >> - Optionally, the job fixes the MV inconsistencies. >> - Rich configuration (throttling, actionable output, capability to >> specify the time range for the records, etc.) to run the job at Scale in a >> production environment >> >> Design doc: link >> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing> >> The Git Repository: link >> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> >> >> *Motivation* >> >> 1. This email's primary objective is to share with the community that >> something like this is available for MV (in a private repository), which >> may be helpful in emergencies to folks stuck with MV in production. >> 2. If we, as a community, want to officially foster tooling using >> Spark because it can be helpful to do many things beyond the MV work, such >> as counting rows, etc., then I am happy to drive the efforts. >> >> Please let me know what you think. >> >> Jaydeep >> >