Hi Gopal/All, Yep, I absolutely understand the limitation of what we are trying to do. We will try the settings you suggested.
Thanks, Steve On Tue, Sep 22, 2015 at 1:44 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > select small.* from small s left join large l on s.id <http://s.id/> = > >l.id <http://l.id/> where l.id <http://l.id/> is null; > ... > > We simply want to load the 81K rows in to RAM, then for each row in > >large, check the small hash table and if it the row in small is not in > >large, then add it to large. > > That seems like a fair description of the problem with the approach, but > it is fairly hard to do left outers in a distributed fashion with > correctness. > > Imagine I load partition #1 from the big table + the small table into a > task - how would I know partiton #2 doesn't have a key from the small > hashtable. So the system cannot output any rows in the naive version of > this problem until it compares the small table to all parts of the big > table. > > That I think was Sergey's point & I don't know of an easy way around that > yet. > > > The business case is loading only new rows into a large fact table. The > >new rows are the ones that are small in number. > > That however is much easier - because that's better written as. > > insert into large select * from small where id NOT IN (select id from > large); > > FYI, we want people to stop writing these sort of queries once the MERGE > syntax lands in HIVE - https://issues.apache.org/jira/browse/HIVE-10924 > > > The trick is to rewrite the NOT IN condition as two set operations - inner > join + remainder left outer. > > Find all ids which already exist in bigtable with an inner map-join. > Remove all those ids from the small table & insert the remainder. > > explain rewrite select remainder.* from small remainder where id not in > (select id from small, large where small.id = large.id); > > That query, you will find will run much faster than the query you're > currently using. > > If you can, please send the "explain rewrite" & I can probably fine-tune > this approach further. > > If you're using Tez, you might want to try out the custom vertex managers > as well > > set hive.vectorized.execution.mapjoin.minmax.enabled=true; > > set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true; > set hive.optimize.dynamic.partition.hashjoin=true; > > Those params should allow you to scale up a map-join to ~100x the > available RAM (aimed at LLAP memory utilization). > > Cheers, > Gopal > > >