> select small.* from small s left join large l on s.id <http://s.id/> = >l.id <http://l.id/> where l.id <http://l.id/> is null; ... > We simply want to load the 81K rows in to RAM, then for each row in >large, check the small hash table and if it the row in small is not in >large, then add it to large.
That seems like a fair description of the problem with the approach, but it is fairly hard to do left outers in a distributed fashion with correctness. Imagine I load partition #1 from the big table + the small table into a task - how would I know partiton #2 doesn't have a key from the small hashtable. So the system cannot output any rows in the naive version of this problem until it compares the small table to all parts of the big table. That I think was Sergey's point & I don't know of an easy way around that yet. > The business case is loading only new rows into a large fact table. The >new rows are the ones that are small in number. That however is much easier - because that's better written as. insert into large select * from small where id NOT IN (select id from large); FYI, we want people to stop writing these sort of queries once the MERGE syntax lands in HIVE - https://issues.apache.org/jira/browse/HIVE-10924 The trick is to rewrite the NOT IN condition as two set operations - inner join + remainder left outer. Find all ids which already exist in bigtable with an inner map-join. Remove all those ids from the small table & insert the remainder. explain rewrite select remainder.* from small remainder where id not in (select id from small, large where small.id = large.id); That query, you will find will run much faster than the query you're currently using. If you can, please send the "explain rewrite" & I can probably fine-tune this approach further. If you're using Tez, you might want to try out the custom vertex managers as well set hive.vectorized.execution.mapjoin.minmax.enabled=true; set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true; set hive.optimize.dynamic.partition.hashjoin=true; Those params should allow you to scale up a map-join to ~100x the available RAM (aimed at LLAP memory utilization). Cheers, Gopal