Hi Gopal, Thanks a lot for the above update.
I had only one question hanging: When is OOM error actually thrown? With hive.mapjoin.hybridgrace.hashtable set to true, spilling should be possible, so OOM error should not come. Is it the case when the hash table of not even one of the 16 partitions fits in memory? But increasing the partitions to 100 also did not solve the problem (This is in the case of 3G container size and 5G small table size. I have given a high value for hive.auto.convert.join.noconditionaltask.size so that the broadcast hash join path is picked. I know this is not advisable, but I am still trying to enforce. When I give 100 partitions, the hash table of 500Mb only has to fit in at a time, but it fails with memory error ) Thanks, Lalitha On Thu, Jul 14, 2016 at 11:16 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > I got a chance to re-run this query today and it does auto-reduce the new > CUSTOM_EDGE join as well. > > However it does it too early, before it has got enough information about > both sides of the join. > > TPC-DS Query79 at 1Tb scale generates a CUSTOM_EDGE between the ms alias > and > customer tables. > > They're 13.7 million & 12 million rows each - the impl is not causing any > trouble right now because they're roughly equal. > > This might be a feature I might disable for now, since the > ShuffleVertexManager > has no idea which side is destined for the hashtable build side & might > squeeze > the auto-reducer tighter than it should causing a possible OOM. > > Plus, the size that the system gets is actually the shuffle size, so it is > post-compression > and varies with the data compressibility (for instance, clickstreams > compress really well > due to the large number of common strings). > > A pre-req for the last problem would be to go fix > <https://issues.apache.org/jira/browse/TEZ-2962>, > which tracks pre-compression sizes. > > I'll have to think a bit more about the other part of the problem. > > Cheers, > Gopal > > On 7/6/16, 12:52 PM, "Gopal Vijayaraghavan" <go...@hortonworks.com on > behalf of gop...@apache.org> wrote: > > > > >> I tried running the shuffle hash join with auto reducer parallelism > >>again. But, it didn't seem to take effect. With merge join and auto > >>reduce parallelism on, number of > >> reducers drops from 1009 to 337, but didn't see that change in case of > >>shuffle hash join .Should I be doing something more ? > > > >In your runtime, can you cross-check the YARN logs for > > > >"Defer scheduling tasks" > > > >and > > > >"Reduce auto parallelism for vertex" > > > > > >The kick-off point for both join algorithms are slightly different. > > > >The merge-join uses stats from both sides (since until all tasks on both > >sides are complete, no join ops can be performed). > > > >While the dynamic hash-join uses stats from only the hashtable side to do > >this, since it can start producing rows immediately after the hashtable > >side is complete & at least 1 task from the big-table side has completed. > > > >There's a scenario where the small table is too small even when it is > >entirely complete at which point the system just goes ahead with reduce > >tasks without any reduction (which would be faster than waiting for the > >big table to hit slow-start before doing that). > > > >Cheers, > >Gopal > >