Thanks, it makes sense. On Thursday, March 12, 2015, Daniel Siegmann <[email protected]> wrote:
> Join causes a shuffle (sending data across the network). I expect it will > be better to filter before you join, so you reduce the amount of data which > is sent across the network. > > Note this would be true for *any* transformation which causes a shuffle. > It would not be true if you're combining RDDs with union, since that > doesn't cause a shuffle. > > On Thu, Mar 12, 2015 at 11:04 AM, shahab <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi, >> >> Probably this question is already answered sometime in the mailing list, >> but i couldn't find it. Sorry for posting this again. >> >> I need to to join and apply filtering on three different RDDs, I just >> wonder which of the following alternatives are more efficient: >> 1- first joint all three RDDs and then do filtering on resulting joint >> RDD or >> 2- Apply filtering on each individual RDD and then join the resulting RDDs >> >> >> Or probably there is no difference due to lazy evaluation and under >> beneath Spark optimisation? >> >> best, >> /Shahab >> > >
