Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

shahab Fri, 13 Mar 2015 05:11:28 -0700

Thanks, it makes sense.

On Thursday, March 12, 2015, Daniel Siegmann <[email protected]>
wrote:


> Join causes a shuffle (sending data across the network). I expect it will
> be better to filter before you join, so you reduce the amount of data which
> is sent across the network.
>
> Note this would be true for *any* transformation which causes a shuffle.
> It would not be true if you're combining RDDs with union, since that
> doesn't cause a shuffle.
>
> On Thu, Mar 12, 2015 at 11:04 AM, shahab <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Hi,
>>
>> Probably this question is already answered sometime in the mailing list,
>> but i couldn't find it. Sorry for posting this again.
>>
>> I need to to join and apply filtering on three different RDDs, I just
>> wonder which of the following alternatives are more efficient:
>> 1- first joint all three RDDs and then do  filtering on resulting joint
>> RDD   or
>> 2- Apply filtering on each individual RDD and then join the resulting RDDs
>>
>>
>> Or probably there is no difference due to lazy evaluation and under
>> beneath Spark optimisation?
>>
>> best,
>> /Shahab
>>
>
>

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

Reply via email to