RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
that DStreams are some sort of different type of RDDs From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 11:11 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Well, DStream joins are nothing but RDD joins at its core. However

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
gt; *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* Wednesday, April 15, 2015 9:48 PM > > *To:* Evo Eftimov > *Cc:* user > *Subject:* Re: RAM management during cogroup and join > > > > Agreed. > > > > On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
Subject: Re: RAM management during cogroup and join Agreed. On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov wrote: That has been done Sir and represents further optimizations – the objective here was to confirm whether cogroup always results in the previously described “greedy” explosion of

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
5 9:25 PM > *To:* Evo Eftimov > *Cc:* user > *Subject:* Re: RAM management during cogroup and join > > > > Significant optimizations can be made by doing the joining/cogroup in a > smart way. If you have to join streaming RDDs with the same batch RDD, then > you can first par

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
change the total number of elements included in the result RDD and RAM allocated – right? From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:25 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Significant optimizations can be made

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Significant optimizations can be made by doing the joining/cogroup in a smart way. If you have to join streaming RDDs with the same batch RDD, then you can first partition the batch RDDs using a partitions and cache it, and then use the same partitioner on the streaming RDDs. That would make sure t