Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Doug Rohrer Wed, 05 Apr 2023 08:18:36 -0700

Sorry for the delay in responding here - yes, we can add some diagrams to the 
CEP - I’ll try to get that done by end-of-week.


Thanks,

Doug

> On Mar 28, 2023, at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
> 
> Maybe some data flow diagrams could be added to the cep showing some example 
> operations for read/write?
> 
>> On Mar 28, 2023, at 11:35 AM, Yifan Cai <yc25c...@gmail.com> wrote:
>> 
>> 
>> A lot of great discussions! 
>> 
>> On the sidecar front, especially what the role sidecar plays in terms of 
>> this CEP, I feel there might be some confusion. Once the code is published, 
>> we should have clarity.
>> Sidecar does not read sstables nor do any coordination for analytics 
>> queries. It is local to the companion Cassandra instance. For bulk read, it 
>> takes snapshots and streams sstables to spark workers to read. For bulk 
>> write, it imports the sstables uploaded from spark workers. All commands are 
>> existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http 
>> interface to them. It might be an over simplified description. The complex 
>> computation is performed in spark clusters only.
>> 
>> In the long run, Cassandra might evolve into a database that does both OLTP 
>> and OLAP. (Not what this thread aims for) 
>> At the current stage, Spark is very suited for analytic purposes. 
>> 
>> On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org 
>> <mailto:bened...@apache.org>> wrote:
>>> I disagree with the first claim, as the process has all the information it 
>>> chooses to utilise about which resources it’s using and what it’s using 
>>> those resources for.
>>> 
>>> The inability to isolate GC domains is something we cannot address, but 
>>> also probably not a problem if we were doing everything with memory 
>>> management as well as we could be.
>>> 
>>> But, not worth detailing this thread for. Today we do very little well on 
>>> this front within the process, and a separate process is well justified 
>>> given the state of play.
>>> 
>>>> On 28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org 
>>>> <mailto:de...@chen-becker.org>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com 
>>>> <mailto:joe.e.ly...@gmail.com>> wrote:
>>>> ...
>>>> 
>>>>> I think we might be underselling how valuable JVM isolation is,
>>>>> especially for analytics queries that are going to pass the entire
>>>>> dataset through heap somewhat constantly. 
>>>> 
>>>> Big +1 here. The JVM simply does not have significant granularity of 
>>>> control for resource utilization, but this is explicitly a feature of 
>>>> separate processes. Add in being able to separate GC domains and you can 
>>>> avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.
>>>> 
>>>> Cheers,
>>>> 
>>>> Derek
>>>> 
>>>> 
>>>> -- 
>>>> +---------------------------------------------------------------+
>>>> | Derek Chen-Becker                                             |
>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>> +---------------------------------------------------------------+
>>>>

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to