Yes it would, you can create a key and then partition it (say HashPartitioner) and then joining would be faster as all the similar keys will go in one partition.
Thanks Best Regards On Sun, Feb 1, 2015 at 5:13 PM, Sunita Arvind <sunitarv...@gmail.com> wrote: > Hi All > > We are joining large tables using spark sql and running into shuffle > issues. We have explored multiple options - using coalesce to reduce number > of partitions, tuning various parameters like disk buffer, reducing data in > chunks etc. which all seem to help btw. What I would like to know is, > is having a pair rdd over regular rdd one of the solutions ? Will it make > the joining more efficient as spark can shuffle better since it knows the > key? Logically speaking I think it should help but I haven't found any > evidence on the internet including the spark sql documentation. > > It is a lot of effort for us to try this approach and weight the > performance as we need to register the output as tables to proceed using > them. Hence would appreciate inputs from the community before proceeding. > > > Regards > Sunita Koppar > >