RE: RAM management during cogroup and join

Evo Eftimov Wed, 15 Apr 2015 15:26:42 -0700

Data Frames are available from the latest 1.3 release I believe – in 1.2 (our 
case at the moment) I guess the options are more limited


 

PS: agree that DSTreams are just an abstraction for a sequence / streams of 
(ordinary) RDDs – when i use “DStreams” I mean the DStream OO API in Spark not 
that DStreams are some sort of different type of RDDs

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Wednesday, April 15, 2015 11:11 PM
To: Evo Eftimov
Cc: user
Subject: Re: RAM management during cogroup and join

 

Well, DStream joins are nothing but RDD joins at its core. However, there are 
more optimizations that you using DataFrames and Spark SQL joins. With the 
schema, there is a greater scope for optimizing the joins. So converting RDDs 
from streaming and the batch RDDs to data frames, and then applying joins may 
improve performance.

 

TD

 

On Wed, Apr 15, 2015 at 1:50 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:

Thank you Sir, and one final confirmation/clarification -  are all forms of 
joins in the Spark API for DStream RDDs based on cogroup in terms of their 
internal implementation 

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Wednesday, April 15, 2015 9:48 PM


To: Evo Eftimov
Cc: user
Subject: Re: RAM management during cogroup and join

 

Agreed. 

 

On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:

That has been done Sir and represents further optimizations – the objective 
here was to confirm whether cogroup always results in the previously described 
“greedy” explosion of the number of elements included and RAM allocated for the 
result RDD 

 

The optimizations mentioned still don’t change the total number of elements 
included in the result RDD and RAM allocated – right? 

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Wednesday, April 15, 2015 9:25 PM
To: Evo Eftimov
Cc: user
Subject: Re: RAM management during cogroup and join

 

Significant optimizations can be made by doing the joining/cogroup in a smart 
way. If you have to join streaming RDDs with the same batch RDD, then you can 
first partition the batch RDDs using a partitions and cache it, and then use 
the same partitioner on the streaming RDDs. That would make sure that the large 
batch RDDs is not partitioned repeatedly for the cogroup, only the small 
streaming RDDs are partitioned.

 

HTH

 

TD

 

On Wed, Apr 15, 2015 at 1:11 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:

There are indications that joins in Spark are implemented with / based on the
cogroup function/primitive/transform. So let me focus first on cogroup - it
returns a result which is RDD consisting of essentially ALL elements of the
cogrouped RDDs. Said in another way - for every key in each of the cogrouped
RDDs there is at least one element from at least one of the cogrouped RDDs.

That would mean that when smaller, moreover streaming e.g.
JavaPairDstreamRDDs keep getting joined with much larger, batch RDD that
would result in RAM allocated for multiple instances of the result
(cogrouped) RDD a.k.a essentially the large batch RDD and some more ...
Obviously the RAM will get returned when the DStream RDDs get discard and
they do on a regular basis, but still that seems as unnecessary spike in the
RAM consumption

I have two questions:

1.Is there anyway to control the cogroup process more "precisely" e.g. tell
it to include I the cogrouped RDD only elements where there are at least one
element from EACH of the cogrouped RDDs per given key. Based on the current
cogroup API this is not possible


2.If the cogroup is really such a sledgehammer and secondly the joins are
based on cogroup then even though they can present a prettier picture in
terms of the end result visible to the end user does that mean that under
the hood there is still the same atrocious RAM consumption going on




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RAM-management-during-cogroup-and-join-tp22505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: RAM management during cogroup and join

Reply via email to