Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Sebastian Estevez Tue, 02 May 2023 12:52:49 -0700

Hey Dinesh,

Yeah it makes sense that the sstable streaming is network bound since it's
mostly just moving files.


Do you have any performance stats on the sstable parsing side inside spark?

--Seb

On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi <[email protected]> wrote:

> It is line rate / network bound. We have a patch out in vert.x that should
> use the zero copy path for it. But it's not a strict prereq for it.
>
> On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> > Hi folks,
> >
> > Great stuff thanks for sharing.
> >
> > The performance numbers I've seen so far are for the sidecar streaming
> > sstables (seems like this is just network bound?). What kind of perf are
> > you seeing at the Spark executors (at the per task level)?
> >
> > --Seb
> >
> > On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi <[email protected]> wrote:
> >
> > > Does anybody have any questions that we could answer about this
> proposal?
> > >
> > > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <
> [email protected]>
> > > wrote:
> > >
> > > Hi folks,
> > >
> > > We have updated the confluence page with the source code for CEP-28.
> > > There are two repositories with contributions. One is the patch [1]
> > > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > > Spark Analytics library. The second is a new repository [2] with
> > > contributions to the Cassandra Spark Analytics code
> > >
> > > We also have a README markdown file that you can follow to give the
> > > code a try:
> > >
> > >
> > >
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> > >
> > > Best,
> > > - Francisco
> > >
> > > [1] Apache Cassandra Sidecar bulk APIs source code:
> > > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > > [2] Apache Cassandra Spark Analytics source code:
> > > https://github.com/frankgh/cassandra-analytics
> > >
> > >
> > > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > > responding here - yes, we can add some diagrams to the CEP - I’ll try
> to
> > > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28,
> 2023,
> > > at 1:14 PM, J. D. Jordan <[email protected]> wrote: > > > >
> Maybe
> > > some data flow diagrams could be added to the cep showing some example
> > > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM,
> Yifan Cai
> > > <[email protected]> wrote: > >> > >>  > >> A lot of great
> discussions!
> > > > >> > >> On the sidecar front, especially what the role sidecar plays
> in
> > > terms of this CEP, I feel there might be some confusion. Once the code
> is
> > > published, we should have clarity. > >> Sidecar does not read sstables
> nor
> > > do any coordination for analytics queries. It is local to the companion
> > > Cassandra instance. For bulk read, it takes snapshots and streams
> sstables
> > > to spark workers to read. For bulk write, it imports the sstables
> uploaded
> > > from spark workers. All commands are existing jmx/nodetool
> functionalities
> > > from Cassandra. Sidecar adds the http interface to them. It might be an
> > > over simplified description. The complex computation is performed in
> spark
> > > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > > database that does both OLTP and OLAP. (Not what this thread aims for)
> > >>
> > > At the current stage, Spark is very suited for analytic purposes. > >>
> > >>
> > > On Tue, Mar 28, 2023 at 9:06 AM Benedict <[email protected] <mailto:
> > > [email protected]>> wrote: > >>> I disagree with the first claim, as
> > > the process has all the information it chooses to utilise about which
> > > resources it’s using and what it’s using those resources for. > >>> >
> >>>
> > > The inability to isolate GC domains is something we cannot address, but
> > > also probably not a problem if we were doing everything with memory
> > > management as well as we could be. > >>> > >>> But, not worth detailing
> > > this thread for. Today we do very little well on this front within the
> > > process, and a separate process is well justified given the state of
> play.
> > > > >>> > >>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > > [email protected] <mailto:[email protected]>> wrote: > >>>> >
> > > >>>>  > >>>> > >>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > > [email protected] <mailto:[email protected]>> wrote: > >>>>
> ... >
> > > >>>> > >>>>> I think we might be underselling how valuable JVM
> isolation
> > > is, > >>>>> especially for analytics queries that are going to pass the
> > > entire > >>>>> dataset through heap somewhat constantly. > >>>> > >>>>
> Big
> > > +1 here. The JVM simply does not have significant granularity of
> control
> > > for resource utilization, but this is explicitly a feature of separate
> > > processes. Add in being able to separate GC domains and you can avoid
> a lot
> > > of noisy neighbor in-VM behavior for the disparate workloads. > >>>> >
> >>>>
> > > Cheers, > >>>> > >>>> Derek > >>>> > >>>> > >>>> -- > >>>>
> > > +---------------------------------------------------------------+ >
> >>>> |
> > > Derek Chen-Becker | > >>>> | GPG Key available at
> > > https://keybase.io/dchenbecker and | > >>>> |
> > > https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >
> >>>> |
> > > Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > >>>>
> > > +---------------------------------------------------------------+ >
> >>>> >
> > > >
> > > --
> > > Francisco Guerrero
> > >
> > >
> > >
> >
> > --
> > All the best,
> >
> > Sebastián
> >
>


-- 
All the best,

Sebastián

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to