When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’? If you .cache() and .count() to force a shuffle, it'll push the records that will be joined to the same executors.
So; a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache() a.count() b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache() b.count() And then join.. > On Aug 8, 2016, at 8:17 PM, Ashic Mahtab <as...@live.com> wrote: > > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB) > b: id:String, Number:Int (1.3GB) > > We need to join these two to get (id, Number, Name). We've tried two > approaches: > > a.join(b, Seq("id"), "right_outer") > > where a and b are dataframes. We also tried taking the rdds, mapping them to > pair rdds with id as the key, and then joining. What we're seeing is that > temp file usage is increasing on the join stage, and filling up our disks, > causing the job to crash. Is there a way to join these two data sets without > well...crashing? > > Note, the ids are unique, and there's a one to one mapping between the two > datasets. > > Any help would be appreciated. > > -Ashic.