Re: Tuning Triangle Joins on Hive

Firas Abuzaid Wed, 06 Aug 2014 16:43:16 -0700

Thanks, Gopal! That helps a lot!

--Firas



On Wed, Aug 6, 2014 at 2:03 PM, Gopal V <gop...@apache.org> wrote:

> On 7/31/14, 12:28 PM, Firas Abuzaid wrote:
>
>  We're running various "triangle" join queries on Hive 0.9.0, and we're
>> wondering if we can get any better performance. Here's the query we're
>> running:
>>
>> SELECT count(*)
>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
>> =
>> r3.src AND r3.dst = r1.src)
>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>
>
>
> This is an interesting use-case for a JOIN clause.
>
> The main overhead is that the JOIN criteria is materializing JOINs onto
> HDFS.
>
> Neither hive indexes, nor ORC indexes will help you here, unfortunately.
>
> With the latest hive-13, I tried rewriting this to get a performance boost
> with a CTE (which will be streamed through in tez‚ but MR will still write
> it as SequenceFiles to HDFS).
>
> Roughly, you need to remove the implicit filter of r1.src < r2.src &&
> r2.src < r3.src into an independent sub query.
>
> explain
> with r2_tmp as (
> select r2.src as r2_src, r3.dst as r3_dst from r2 join r3 on (r2.dst =
> r3.src) where r2.src < r3.src
> )
> select count(1) from r1 JOIN r2_tmp on (r1.dst = r2_src and r1.src =
> r3_dst)
> where (r1.src < r2_src)
>
> Vectorization+ORC will massively reduce the GC overhead for this as well -
> but for all these you need a modern hive version.
>
> Data organization can give a bigger performance boost as well, because of
> how ORC will do run-length packing of the same valued columns. I'd just
> sort/bucket on src for all of them.
>
> You can see a similar organization for easier big table JOINs here -
> https://github.com/t3rmin4t0r/nyc-taxi-bigdata/blob/master/ddl/orc.sql#L26
>
> HTH.
>
> Cheers,
> Gopal
>
>

Re: Tuning Triangle Joins on Hive

Reply via email to