Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Jeremy Hanna Mon, 27 Mar 2023 15:06:59 -0700

Thank you for the write-up and the efforts on CASSANDRA-16222.  It sounds like 
you've been using this for some time.  I understand from the rejected 
alternatives that the Spark Cassandra Connector was slower because it goes 
through the read and write path for C* rather than this backdoor mechanism.  In 
your experience using this, under what circumstances have you seen that this 
tool is not a good fit for analytics - such as complex predicates?  The 
challenge with the Spark Cassandra Connector and previously the Hadoop 
integration is that it had to do full table scans even to get small amounts of 
data.  It sounds like this is similar in that it has to do a full table scan 
but with the advantage of being faster and less load on the cluster.  In other 
words, I'm asking if this has been a replacement for the Spark Cassandra 
Connector or if there are cases in your work where SCC is a better fit.


Also to Benjamin's point in the comments on the CEP itself, how coupled is this 
to internals?  Are there going to be higher level APIs or is it going to call 
internal storage classes directly?

Thanks!

Jeremy


> On Mar 23, 2023, at 12:33 PM, Doug Rohrer <droh...@apple.com> wrote:
> 
> Hi everyone,
> 
> Wiki: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics
> 
> We’d like to propose this CEP for adoption by the community.
> 
> It is common for teams using Cassandra to find themselves looking for a way 
> to interact with large amounts of data for analytics workloads. However, 
> Cassandra’s standard APIs aren’t designed for large scale data egress/ingest 
> as the native read/write paths weren’t designed for bulk analytics.
> 
> We’re proposing this CEP for this exact purpose. It enables the 
> implementation of custom Spark (or similar) applications that can either read 
> or write large amounts of Cassandra data at line rates, by accessing the 
> persistent storage of nodes in the cluster via the Cassandra Sidecar.
> 
> This CEP proposes new APIs in the Cassandra Sidecar and a companion library 
> that allows deep integration into Apache Spark that allows its users to bulk 
> import or export data from a running Cassandra cluster with minimal to no 
> impact to the read/write traffic.
> 
> We will shortly publish a branch with code that will accompany this CEP to 
> help readers understand it better.
> 
> As a reminder, please keep the discussion here on the dev list vs. in the 
> wiki, as we’ve found it easier to manage via email.
> 
> Sincerely,
> 
> Doug Rohrer & James Berragan

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to