It is line rate / network bound. We have a patch out in vert.x that should use the zero copy path for it. But it's not a strict prereq for it.
On 2023/05/02 15:39:02 Sebastian Estevez wrote: > Hi folks, > > Great stuff thanks for sharing. > > The performance numbers I've seen so far are for the sidecar streaming > sstables (seems like this is just network bound?). What kind of perf are > you seeing at the Spark executors (at the per task level)? > > --Seb > > On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi <djo...@apache.org> wrote: > > > Does anybody have any questions that we could answer about this proposal? > > > > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <frank.guerr...@gmail.com> > > wrote: > > > > Hi folks, > > > > We have updated the confluence page with the source code for CEP-28. > > There are two repositories with contributions. One is the patch [1] > > for Cassandra Sidecar with the bulk APIs that enable the Cassandra > > Spark Analytics library. The second is a new repository [2] with > > contributions to the Cassandra Spark Analytics code > > > > We also have a README markdown file that you can follow to give the > > code a try: > > > > > > https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md > > > > Best, > > - Francisco > > > > [1] Apache Cassandra Sidecar bulk APIs source code: > > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis > > [2] Apache Cassandra Spark Analytics source code: > > https://github.com/frankgh/cassandra-analytics > > > > > > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in > > responding here - yes, we can add some diagrams to the CEP - I’ll try to > > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023, > > at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: > > > > Maybe > > some data flow diagrams could be added to the cep showing some example > > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai > > <yc25c...@gmail.com> wrote: > >> > >> > >> A lot of great discussions! > > > >> > >> On the sidecar front, especially what the role sidecar plays in > > terms of this CEP, I feel there might be some confusion. Once the code is > > published, we should have clarity. > >> Sidecar does not read sstables nor > > do any coordination for analytics queries. It is local to the companion > > Cassandra instance. For bulk read, it takes snapshots and streams sstables > > to spark workers to read. For bulk write, it imports the sstables uploaded > > from spark workers. All commands are existing jmx/nodetool functionalities > > from Cassandra. Sidecar adds the http interface to them. It might be an > > over simplified description. The complex computation is performed in spark > > clusters only. > >> > >> In the long run, Cassandra might evolve into a > > database that does both OLTP and OLAP. (Not what this thread aims for) > >> > > At the current stage, Spark is very suited for analytic purposes. > >> > >> > > On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org <mailto: > > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as > > the process has all the information it chooses to utilise about which > > resources it’s using and what it’s using those resources for. > >>> > >>> > > The inability to isolate GC domains is something we cannot address, but > > also probably not a problem if we were doing everything with memory > > management as well as we could be. > >>> > >>> But, not worth detailing > > this thread for. Today we do very little well on this front within the > > process, and a separate process is well justified given the state of play. > > > >>> > >>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker < > > de...@chen-becker.org <mailto:de...@chen-becker.org>> wrote: > >>>> > > > >>>> > >>>> > >>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch < > > joe.e.ly...@gmail.com <mailto:joe.e.ly...@gmail.com>> wrote: > >>>> ... > > > >>>> > >>>>> I think we might be underselling how valuable JVM isolation > > is, > >>>>> especially for analytics queries that are going to pass the > > entire > >>>>> dataset through heap somewhat constantly. > >>>> > >>>> Big > > +1 here. The JVM simply does not have significant granularity of control > > for resource utilization, but this is explicitly a feature of separate > > processes. Add in being able to separate GC domains and you can avoid a lot > > of noisy neighbor in-VM behavior for the disparate workloads. > >>>> > >>>> > > Cheers, > >>>> > >>>> Derek > >>>> > >>>> > >>>> -- > >>>> > > +---------------------------------------------------------------+ > >>>> | > > Derek Chen-Becker | > >>>> | GPG Key available at > > https://keybase.io/dchenbecker and | > >>>> | > > https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > >>>> | > > Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > >>>> > > +---------------------------------------------------------------+ > >>>> > > > > > > -- > > Francisco Guerrero > > > > > > > > -- > All the best, > > Sebastián >