Hi Gopal, I saw the log files and the hash table information in it. Thanks.
Also, I enforced shuffle hash join. I had a couple of questions around it: 1. In the query plan, it still says Map Join Operator (Would have expected it to be named as Reduce side operator). 2. The edges in this query plans were named as custom_simple_edge: Is this the one pointing to the fact that sorting of mapper inputs are bypassed? 3. "hive.tez.auto.reducer.parallelism=true" is not taking effect for shuffle hash join. With the same input tables, in merge join (Shuffle sort merge join), it took 1009 reducers without auto reducer turned on and took 337 reducers in the other case. While in case of shuffle hash join, it is not changing from 1009 to 337. Is there something else I need to do, for getting this optimization feature on, in this case? I had a few general questions too: 1. What does tez.auto.reducer.parallelism do -- Does it only reduce the number of reducers based on the actual size of mapper output, or does it do more. Because as mentioned above, in sort merge join case, if I try to manually set the number of reduce tasks to 337 (using mapred.reduce.tasks parameter), the execution time does not improve as much as when auto.red param picks it by itself. 2. I did not understand the intuition behind setting hive.mapjoin.hybridgrace.hashtable=false (as mentioned in your previous reply). Does hybrid grace hashtable mean the Hybrid Hybrid grace Hash join implementation as mentioned here <https://cwiki.apache.org/confluence/display/Hive/Hybrid+Hybrid+Grace+Hash+Join%2C+v1.0> . If it is set to true, the hash table is created with multiple partitions. If it is set to false, is the hash table created as a single hash table? Isn't the true case better, as it can handle the case where the hash join cannot fit in memory better. Also, there will be smaller lookups. I ran both the cases (with gracehashtable set to true and false), and did not see any difference in execution time -- maybe because my input size was considerably small in that case. 3. In general, map join in cluster mode, are these the actual steps followed in hive/tez: a. *Hash table generation: * Partitioned hash tables of the small table is created across multiple containers. In each container, a part of the small table is dealt with. And in each container, the hash table is built for that part, in 16 partitions. If any partition cannot fit in memory, it is spilled to disk (with only disk file and not match file, since there is no matching with big table happening). b. *Broadcast of hash table*: All the partitions of all the parts of mall table, including the ones spilled in the disk are serialized and sent to all the second map containers. c. *Join operator*: The big table is scanned in each second mapper, against the entire hash table of small table, and result is got. Where does the rebuilding of spilt hash table happen? Is it during second map phase where join is happening with bigger table? Apologies for the long list of questions. But knowing this would be very helpful to me. Thanks in advance, Ross On Mon, Jun 27, 2016 at 7:25 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > 1. OOM condition -- I get the following error when I force a map join in > >hive/tez with low container size and heap size:" > >java.lang.OutOfMemoryError: Java heap space". I was wondering what is the > >condition which leads to this error. > > You are not modifying the noconditionaltasksize to match the Xmx at all. > > hive.auto.convert.join.noconditionaltask.size=(Xmx - io.sort.mb)/3.0; > > > > 2. Shuffle Hash Join -- I am using hive 2.0.1. What is the way to force > >this join implementation? Is there any documentation regarding the same? > > <https://issues.apache.org/jira/browse/HIVE-10673> > > > For full-fledged speed-mode, do > > set hive.vectorized.execution.reduce.enabled=true; > set hive.optimize.dynamic.partition.hashjoin=true; > set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true; > set hive.mapjoin.hybridgrace.hashtable=false; > > > 3. Hash table size: I use "--hiveconf hive.root.logger=INFO,console" for > >seeing logs. I do not see the hash table size in these logs. > > No, the hashtables are no longer built on the gateway nodes - that used to > be a single point of failure when 20-25 usere are connected via the same > box. > > The hashtable logs are in the task side (in this case, I would guess Map > 2's logs would have it). The output is from a log like which looks like > > yarn logs -applicationId <app-id> | grep Map.*metrics > > > Map 1 3 0 0 > >37.11 65,710 1,039 15,000,000 > >15,000,000 > > > So you have 15 million keys going into a single hashtable? The broadcast > output rows is fed into the hashtable on the other side. > > The map-join sort of runs out of steam after about ~4 million entries - I > would guess for your scenario setting the noconditional size to 8388608 > (~8Mb) might trigger the good path. > > Cheers, > Gopal > > > > >