Re: Hash table in map join - Hive

Gopal Vijayaraghavan Wed, 06 Jul 2016 12:53:14 -0700

> I tried running the shuffle hash join with auto reducer parallelism
>again. But, it didn't seem to take effect. With merge join and auto
>reduce parallelism on, number of
> reducers drops from 1009 to 337, but didn't see that change in case of
>shuffle hash join .Should I be doing something more ?


In your runtime, can you cross-check the YARN logs for

"Defer scheduling tasks"

and 

"Reduce auto parallelism for vertex"


The kick-off point for both join algorithms are slightly different.

The merge-join uses stats from both sides (since until all tasks on both
sides are complete, no join ops can be performed).

While the dynamic hash-join uses stats from only the hashtable side to do
this, since it can start producing rows immediately after the hashtable
side is complete & at least 1 task from the big-table side has completed.

There's a scenario where the small table is too small even when it is
entirely complete at which point the system just goes ahead with reduce
tasks without any reduction (which would be faster than waiting for the
big table to hit slow-start before doing that).

Cheers,
Gopal

Re: Hash table in map join - Hive

Reply via email to