Try below and see if it makes a difference: val result = sqlContext.sql(“select big.f1, big.f2 from small inner join big on big.s=small.s and big.d=small.d”)
On Wed, Jun 24, 2015 at 11:35 AM, Ulanov, Alexander <[email protected] > wrote: > Hi, > > > > I try to inner join of two tables on two fields(string and double). One > table is 2B rows, the second is 500K. They are stored in HDFS in Parquet. > Spark v 1.4. > > val big = sqlContext.paquetFile(“hdfs://big”) > > data.registerTempTable(“big”) > > val small = sqlContext.paquetFile(“hdfs://small”) > > data.registerTempTable(“small”) > > val result = sqlContext.sql(“select big.f1, big.f2 from big inner join > small on big.s=small.s and big.d=small.d”) > > > > This query fails in the middle due to one of the workers “disk out of > space” with shuffle reported 1.8TB which is the maximum size of my spark > working dirs (on total 7 worker nodes). This is surprising, because the > “big” table takes 2TB disk space (unreplicated) and “small” about 5GB and I > would expect that optimizer will shuffle the small table. How to force > Spark to shuffle the small table? I tried to write “small inner join big” > however it also fails with 1.8TB of shuffle. > > > > Best regards, Alexander > > >
