Hi, I got a chance to re-run this query today and it does auto-reduce the new CUSTOM_EDGE join as well.
However it does it too early, before it has got enough information about both sides of the join. TPC-DS Query79 at 1Tb scale generates a CUSTOM_EDGE between the ms alias and customer tables. They're 13.7 million & 12 million rows each - the impl is not causing any trouble right now because they're roughly equal. This might be a feature I might disable for now, since the ShuffleVertexManager has no idea which side is destined for the hashtable build side & might squeeze the auto-reducer tighter than it should causing a possible OOM. Plus, the size that the system gets is actually the shuffle size, so it is post-compression and varies with the data compressibility (for instance, clickstreams compress really well due to the large number of common strings). A pre-req for the last problem would be to go fix <https://issues.apache.org/jira/browse/TEZ-2962>, which tracks pre-compression sizes. I'll have to think a bit more about the other part of the problem. Cheers, Gopal On 7/6/16, 12:52 PM, "Gopal Vijayaraghavan" <go...@hortonworks.com on behalf of gop...@apache.org> wrote: > >> I tried running the shuffle hash join with auto reducer parallelism >>again. But, it didn't seem to take effect. With merge join and auto >>reduce parallelism on, number of >> reducers drops from 1009 to 337, but didn't see that change in case of >>shuffle hash join .Should I be doing something more ? > >In your runtime, can you cross-check the YARN logs for > >"Defer scheduling tasks" > >and > >"Reduce auto parallelism for vertex" > > >The kick-off point for both join algorithms are slightly different. > >The merge-join uses stats from both sides (since until all tasks on both >sides are complete, no join ops can be performed). > >While the dynamic hash-join uses stats from only the hashtable side to do >this, since it can start producing rows immediately after the hashtable >side is complete & at least 1 task from the big-table side has completed. > >There's a scenario where the small table is too small even when it is >entirely complete at which point the system just goes ahead with reduce >tasks without any reduction (which would be faster than waiting for the >big table to hit slow-start before doing that). > >Cheers, >Gopal