Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Dinesh Joshi Tue, 02 May 2023 12:31:14 -0700

It is line rate / network bound. We have a patch out in vert.x that should use 
the zero copy path for it. But it's not a strict prereq for it.


On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> Hi folks,
> 
> Great stuff thanks for sharing.
> 
> The performance numbers I've seen so far are for the sidecar streaming
> sstables (seems like this is just network bound?). What kind of perf are
> you seeing at the Spark executors (at the per task level)?
> 
> --Seb
> 
> On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi <[email protected]> wrote:
> 
> > Does anybody have any questions that we could answer about this proposal?
> >
> > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <[email protected]>
> > wrote:
> >
> > Hi folks,
> >
> > We have updated the confluence page with the source code for CEP-28.
> > There are two repositories with contributions. One is the patch [1]
> > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > Spark Analytics library. The second is a new repository [2] with
> > contributions to the Cassandra Spark Analytics code
> >
> > We also have a README markdown file that you can follow to give the
> > code a try:
> >
> >
> > https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> >
> > Best,
> > - Francisco
> >
> > [1] Apache Cassandra Sidecar bulk APIs source code:
> > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > [2] Apache Cassandra Spark Analytics source code:
> > https://github.com/frankgh/cassandra-analytics
> >
> >
> > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > responding here - yes, we can add some diagrams to the CEP - I’ll try to
> > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> > at 1:14 PM, J. D. Jordan <[email protected]> wrote: > > > > Maybe
> > some data flow diagrams could be added to the cep showing some example
> > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
> > <[email protected]> wrote: > >> > >>  > >> A lot of great discussions!
> > > >> > >> On the sidecar front, especially what the role sidecar plays in
> > terms of this CEP, I feel there might be some confusion. Once the code is
> > published, we should have clarity. > >> Sidecar does not read sstables nor
> > do any coordination for analytics queries. It is local to the companion
> > Cassandra instance. For bulk read, it takes snapshots and streams sstables
> > to spark workers to read. For bulk write, it imports the sstables uploaded
> > from spark workers. All commands are existing jmx/nodetool functionalities
> > from Cassandra. Sidecar adds the http interface to them. It might be an
> > over simplified description. The complex computation is performed in spark
> > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> > At the current stage, Spark is very suited for analytic purposes. > >> > >>
> > On Tue, Mar 28, 2023 at 9:06 AM Benedict <[email protected] <mailto:
> > [email protected]>> wrote: > >>> I disagree with the first claim, as
> > the process has all the information it chooses to utilise about which
> > resources it’s using and what it’s using those resources for. > >>> > >>>
> > The inability to isolate GC domains is something we cannot address, but
> > also probably not a problem if we were doing everything with memory
> > management as well as we could be. > >>> > >>> But, not worth detailing
> > this thread for. Today we do very little well on this front within the
> > process, and a separate process is well justified given the state of play.
> > > >>> > >>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > [email protected] <mailto:[email protected]>> wrote: > >>>> >
> > >>>>  > >>>> > >>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > [email protected] <mailto:[email protected]>> wrote: > >>>> ... >
> > >>>> > >>>>> I think we might be underselling how valuable JVM isolation
> > is, > >>>>> especially for analytics queries that are going to pass the
> > entire > >>>>> dataset through heap somewhat constantly. > >>>> > >>>> Big
> > +1 here. The JVM simply does not have significant granularity of control
> > for resource utilization, but this is explicitly a feature of separate
> > processes. Add in being able to separate GC domains and you can avoid a lot
> > of noisy neighbor in-VM behavior for the disparate workloads. > >>>> > >>>>
> > Cheers, > >>>> > >>>> Derek > >>>> > >>>> > >>>> -- > >>>>
> > +---------------------------------------------------------------+ > >>>> |
> > Derek Chen-Becker | > >>>> | GPG Key available at
> > https://keybase.io/dchenbecker and | > >>>> |
> > https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > >>>> |
> > Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > >>>>
> > +---------------------------------------------------------------+ > >>>> >
> > >
> > --
> > Francisco Guerrero
> >
> >
> >
> 
> -- 
> All the best,
> 
> Sebastián
>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to