Hello,

I found out that the dependency graph among task stages is incorrect for the skewed join optimized plan.

In particular, the conditional task in the optimized plan maintains no dependency with the child tasks of the common join task in the original plan. The conditional task is composed of the map join task which has all these dependencies, but for the case the map join task is filtered out, all these dependencies are removed.
Hence, all the other task stages of the query are skipped.

The bug resides in "ql/optimizer/physical/GenMRSkewJoinProcessor.java", processSkewJoin() function, immediately after the ConditionalTask is created and its dependencies are set.

I currently fixed the issue by adding dependencies among the ConditonalTask and all the child tasks of the common
join task of the original plan.

From the original design I see that only tasks included in the ConditionalTask are allowed to have dependencies, so I am wondering what shall be the alternative correct implementation? Maybe adding an "nop" task inside the ConditionalTask (in addition to the map join task), so that the dependencies are maintained for the case that the
map join task is filtered out?

Thanks,
Adrian



On 11/15/2013 10:20 PM, Adrian Popescu wrote:

2. In my experiments I also evaluate skewed joins. I enable skew joins through "hive.optimize.skewjoin" and I run the same tpch query 5. The skew join is not actually triggered as the number of rows with the same key is less than "hive.skewjoin.key". Hence, the map join corresponding to the skewed join is filtered out at runtime, but unfortunately all the other stages are also filtered out. Thus, no result is actually generated. If I disable the skew join optimization, the query running only with
common joins returns the result correctly.

I believe this is a bug when the skew join operator is enabled but not triggered. Did anyone experienced the same problem with skew joins on queries of multiple map reduce joins? I attach the explain plan. Essentially only stage 6 and 22 are executed. Everything else is skipped silently with no output result being generated, nor error in "hive.log". Similar behaviour is observed
for other TPCH queries.

Many thanks,
Adrian




--
Adrian

Reply via email to