Re: RAM management during cogroup and join

Tathagata Das Wed, 15 Apr 2015 13:50:30 -0700

Agreed.

On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:


> That has been done Sir and represents further optimizations – the
> objective here was to confirm whether cogroup always results in the
> previously described “greedy” explosion of the number of elements included
> and RAM allocated for the result RDD
>
>
>
> The optimizations mentioned still don’t change the total number of
> elements included in the result RDD and RAM allocated – right?
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* Wednesday, April 15, 2015 9:25 PM
> *To:* Evo Eftimov
> *Cc:* user
> *Subject:* Re: RAM management during cogroup and join
>
>
>
> Significant optimizations can be made by doing the joining/cogroup in a
> smart way. If you have to join streaming RDDs with the same batch RDD, then
> you can first partition the batch RDDs using a partitions and cache it, and
> then use the same partitioner on the streaming RDDs. That would make sure
> that the large batch RDDs is not partitioned repeatedly for the cogroup,
> only the small streaming RDDs are partitioned.
>
>
>
> HTH
>
>
>
> TD
>
>
>
> On Wed, Apr 15, 2015 at 1:11 PM, Evo Eftimov <evo.efti...@isecc.com>
> wrote:
>
> There are indications that joins in Spark are implemented with / based on
> the
> cogroup function/primitive/transform. So let me focus first on cogroup - it
> returns a result which is RDD consisting of essentially ALL elements of the
> cogrouped RDDs. Said in another way - for every key in each of the
> cogrouped
> RDDs there is at least one element from at least one of the cogrouped RDDs.
>
> That would mean that when smaller, moreover streaming e.g.
> JavaPairDstreamRDDs keep getting joined with much larger, batch RDD that
> would result in RAM allocated for multiple instances of the result
> (cogrouped) RDD a.k.a essentially the large batch RDD and some more ...
> Obviously the RAM will get returned when the DStream RDDs get discard and
> they do on a regular basis, but still that seems as unnecessary spike in
> the
> RAM consumption
>
> I have two questions:
>
> 1.Is there anyway to control the cogroup process more "precisely" e.g. tell
> it to include I the cogrouped RDD only elements where there are at least
> one
> element from EACH of the cogrouped RDDs per given key. Based on the current
> cogroup API this is not possible
>
>
> 2.If the cogroup is really such a sledgehammer and secondly the joins are
> based on cogroup then even though they can present a prettier picture in
> terms of the end result visible to the end user does that mean that under
> the hood there is still the same atrocious RAM consumption going on
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RAM-management-during-cogroup-and-join-tp22505.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: RAM management during cogroup and join

Reply via email to