> select small.* from small s left join large l on s.id <http://s.id/> =
>l.id <http://l.id/> where l.id <http://l.id/> is null;
...
> We simply want to load the 81K rows in to RAM, then for each row in
>large, check the small hash table and if it the row in small is not in
>large, then add it to large.

That seems like a fair description of the problem with the approach, but
it is fairly hard to do left outers in a distributed fashion with
correctness.

Imagine I load partition #1 from the big table + the small table into a
task - how would I know partiton #2 doesn't have a key from the small
hashtable. So the system cannot output any rows in the naive version of
this problem until it compares the small table to all parts of the big
table.

That I think was Sergey's point & I don't know of an easy way around that
yet.

> The business case is loading only new rows into a large fact table.  The
>new rows are the ones that are small in number.

That however is much easier - because that's better written as.

insert into large select * from small where id NOT IN (select id from
large);

FYI, we want people to stop writing these sort of queries once the MERGE
syntax lands in HIVE - https://issues.apache.org/jira/browse/HIVE-10924


The trick is to rewrite the NOT IN condition as two set operations - inner
join + remainder left outer.

Find all ids which already exist in bigtable with an inner map-join.
Remove all those ids from the small table & insert the remainder.

explain rewrite select remainder.* from small remainder where id not in
(select id from small, large where small.id = large.id);

That query, you will find will run much faster than the query you're
currently using. 

If you can, please send the "explain rewrite" & I can probably fine-tune
this approach further.

If you're using Tez, you might want to try out the custom vertex managers
as well

set hive.vectorized.execution.mapjoin.minmax.enabled=true;

set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
set hive.optimize.dynamic.partition.hashjoin=true;

Those params should allow you to scale up a map-join to ~100x the
available RAM (aimed at LLAP memory utilization).
 
Cheers,
Gopal


Reply via email to