Agreed. On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> That has been done Sir and represents further optimizations – the > objective here was to confirm whether cogroup always results in the > previously described “greedy” explosion of the number of elements included > and RAM allocated for the result RDD > > > > The optimizations mentioned still don’t change the total number of > elements included in the result RDD and RAM allocated – right? > > > > *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* Wednesday, April 15, 2015 9:25 PM > *To:* Evo Eftimov > *Cc:* user > *Subject:* Re: RAM management during cogroup and join > > > > Significant optimizations can be made by doing the joining/cogroup in a > smart way. If you have to join streaming RDDs with the same batch RDD, then > you can first partition the batch RDDs using a partitions and cache it, and > then use the same partitioner on the streaming RDDs. That would make sure > that the large batch RDDs is not partitioned repeatedly for the cogroup, > only the small streaming RDDs are partitioned. > > > > HTH > > > > TD > > > > On Wed, Apr 15, 2015 at 1:11 PM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > > There are indications that joins in Spark are implemented with / based on > the > cogroup function/primitive/transform. So let me focus first on cogroup - it > returns a result which is RDD consisting of essentially ALL elements of the > cogrouped RDDs. Said in another way - for every key in each of the > cogrouped > RDDs there is at least one element from at least one of the cogrouped RDDs. > > That would mean that when smaller, moreover streaming e.g. > JavaPairDstreamRDDs keep getting joined with much larger, batch RDD that > would result in RAM allocated for multiple instances of the result > (cogrouped) RDD a.k.a essentially the large batch RDD and some more ... > Obviously the RAM will get returned when the DStream RDDs get discard and > they do on a regular basis, but still that seems as unnecessary spike in > the > RAM consumption > > I have two questions: > > 1.Is there anyway to control the cogroup process more "precisely" e.g. tell > it to include I the cogrouped RDD only elements where there are at least > one > element from EACH of the cogrouped RDDs per given key. Based on the current > cogroup API this is not possible > > > 2.If the cogroup is really such a sledgehammer and secondly the joins are > based on cogroup then even though they can present a prettier picture in > terms of the end result visible to the end user does that mean that under > the hood there is still the same atrocious RAM consumption going on > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RAM-management-during-cogroup-and-join-tp22505.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >