As part of fairly complex processing, I am executing a self join query
using HiveContext against a Hive table to find the latest Transaction,
oldest Transaction etc: for a given set of Attributes. I am still using
v1.3.1 and so Window functions are not an option. The simplified query
looks like below.
val df = hiveContext.sql("""SELECT TAB1.KEY1 ,TAB1.KEY2
,MAX(CASE WHEN (TAB1.FLD10 = TAB2.min_FLD10) THEN TAB1.FLD11
ELSE -9999999 END) AS NEW_FLD
FROM TAB1
INNER JOIN
( SELECT KEY1 ,KEY2 ,
MIN(FLD10) AS min_FLD10
FROM TAB1
WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'
GROUP BY KEY1 ,KEY2 ) TAB2
ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1
WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01'
GROUP BY TAB1.KEY1, TAB1.KEY2""")
I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle happens
. Is there a way to avoid reading the same blocks TWICE during the Map
Stage? Is there a way to try to avoid Shuffle? Thank You.