Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Dinesh Joshi Mon, 01 May 2023 12:50:32 -0700
Does anybody have any questions that we could answer about this proposal?

> On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <[email protected]> 
> wrote:
> 
> Hi folks,
> 
> We have updated the confluence page with the source code for CEP-28.
> There are two repositories with contributions. One is the patch [1]
> for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> Spark Analytics library. The second is a new repository [2] with
> contributions to the Cassandra Spark Analytics code
> 
> We also have a README markdown file that you can follow to give the
> code a try:
> 
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> 
> Best,
> - Francisco
> 
> [1] Apache Cassandra Sidecar bulk APIs source code: 
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> [2] Apache Cassandra Spark Analytics source code: 
> https://github.com/frankgh/cassandra-analytics
> 
> 
> On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in responding 
> here - yes, we can add some diagrams to the CEP - I’ll try to get that done 
> by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023, at 1:14 PM, J. D. 
> Jordan <[email protected] <mailto:[email protected]>> wrote: 
> > > > > Maybe some data flow diagrams could be added to the cep showing some 
> example operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, 
> Yifan Cai <[email protected] <mailto:[email protected]>> wrote: > >> > >>  
> > >> A lot of great discussions! > >> > >> On the sidecar front, especially 
> what the role sidecar plays in terms of this CEP, I feel there might be some 
> confusion. Once the code is published, we should have clarity. > >> Sidecar 
> does not read sstables nor do any coordination for analytics queries. It is 
> local to the companion Cassandra instance. For bulk read, it takes snapshots 
> and streams sstables to spark workers to read. For bulk write, it imports the 
> sstables uploaded from spark workers. All commands are existing jmx/nodetool 
> functionalities from Cassandra. Sidecar adds the http interface to them. It 
> might be an over simplified description. The complex computation is performed 
> in spark clusters only. > >> > >> In the long run, Cassandra might evolve 
> into a database that does both OLTP and OLAP. (Not what this thread aims for) 
> > >> At the current stage, Spark is very suited for analytic purposes. > >> > 
> >> On Tue, Mar 28, 2023 at 9:06 AM Benedict <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote: > >>> I disagree with the first claim, 
> as the process has all the information it chooses to utilise about which 
> resources it’s using and what it’s using those resources for. > >>> > >>> The 
> inability to isolate GC domains is something we cannot address, but also 
> probably not a problem if we were doing everything with memory management as 
> well as we could be. > >>> > >>> But, not worth detailing this thread for. 
> Today we do very little well on this front within the process, and a separate 
> process is well justified given the state of play. > >>> > >>>> On 28 Mar 
> 2023, at 16:38, Derek Chen-Becker <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote: > >>>> > >>>>  > >>>> > >>>> On Tue, 
> Mar 28, 2023 at 9:03 AM Joseph Lynch <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote: > >>>> ... > >>>> > >>>>> I think we 
> might be underselling how valuable JVM isolation is, > >>>>> especially for 
> analytics queries that are going to pass the entire > >>>>> dataset through 
> heap somewhat constantly. > >>>> > >>>> Big +1 here. The JVM simply does not 
> have significant granularity of control for resource utilization, but this is 
> explicitly a feature of separate processes. Add in being able to separate GC 
> domains and you can avoid a lot of noisy neighbor in-VM behavior for the 
> disparate workloads. > >>>> > >>>> Cheers, > >>>> > >>>> Derek > >>>> > >>>> 
> > >>>> -- > >>>> 
> +---------------------------------------------------------------+ > >>>> | 
> Derek Chen-Becker | > >>>> | GPG Key available at 
> https://keybase.io/dchenbecker and | > >>>> | 
> https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > >>>> | 
> Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > >>>> 
> +---------------------------------------------------------------+ > >>>> > >
> -- 
> Francisco Guerrero
Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to