Hi Gopal/All,

Yep, I absolutely understand the limitation of what we are trying to do.
We will try the settings you suggested.

Thanks,

Steve

On Tue, Sep 22, 2015 at 1:44 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > select small.* from small s left join large l on s.id <http://s.id/> =
> >l.id <http://l.id/> where l.id <http://l.id/> is null;
> ...
> > We simply want to load the 81K rows in to RAM, then for each row in
> >large, check the small hash table and if it the row in small is not in
> >large, then add it to large.
>
> That seems like a fair description of the problem with the approach, but
> it is fairly hard to do left outers in a distributed fashion with
> correctness.
>
> Imagine I load partition #1 from the big table + the small table into a
> task - how would I know partiton #2 doesn't have a key from the small
> hashtable. So the system cannot output any rows in the naive version of
> this problem until it compares the small table to all parts of the big
> table.
>
> That I think was Sergey's point & I don't know of an easy way around that
> yet.
>
> > The business case is loading only new rows into a large fact table.  The
> >new rows are the ones that are small in number.
>
> That however is much easier - because that's better written as.
>
> insert into large select * from small where id NOT IN (select id from
> large);
>
> FYI, we want people to stop writing these sort of queries once the MERGE
> syntax lands in HIVE - https://issues.apache.org/jira/browse/HIVE-10924
>
>
> The trick is to rewrite the NOT IN condition as two set operations - inner
> join + remainder left outer.
>
> Find all ids which already exist in bigtable with an inner map-join.
> Remove all those ids from the small table & insert the remainder.
>
> explain rewrite select remainder.* from small remainder where id not in
> (select id from small, large where small.id = large.id);
>
> That query, you will find will run much faster than the query you're
> currently using.
>
> If you can, please send the "explain rewrite" & I can probably fine-tune
> this approach further.
>
> If you're using Tez, you might want to try out the custom vertex managers
> as well
>
> set hive.vectorized.execution.mapjoin.minmax.enabled=true;
>
> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
> set hive.optimize.dynamic.partition.hashjoin=true;
>
> Those params should allow you to scale up a map-join to ~100x the
> available RAM (aimed at LLAP memory utilization).
>
> Cheers,
> Gopal
>
>
>

Reply via email to