Re: Tuning Triangle Joins on Hive

Gopal V Wed, 06 Aug 2014 14:03:34 -0700

On 7/31/14, 12:28 PM, Firas Abuzaid wrote:

We're running various "triangle" join queries on Hive 0.9.0, and we're
wondering if we can get any better performance. Here's the query we're
running:


SELECT count(*)
FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst =
r3.src AND r3.dst = r1.src)
WHERE r1.src < r2.src AND r2.src < r3.src;



This is an interesting use-case for a JOIN clause.

The main overhead is that the JOIN criteria is materializing JOINs ontoHDFS.


Neither hive indexes, nor ORC indexes will help you here, unfortunately.

With the latest hive-13, I tried rewriting this to get a performanceboost with a CTE (which will be streamed through in tez‚ but MR willstill write it as SequenceFiles to HDFS).

Roughly, you need to remove the implicit filter of r1.src < r2.src &&r2.src < r3.src into an independent sub query.


explain
with r2_tmp as (

select r2.src as r2_src, r3.dst as r3_dst from r2 join r3 on (r2.dst =r3.src) where r2.src < r3.src

)
select count(1) from r1 JOIN r2_tmp on (r1.dst = r2_src and r1.src = r3_dst)
where (r1.src < r2_src)

Vectorization+ORC will massively reduce the GC overhead for this as well- but for all these you need a modern hive version.

Data organization can give a bigger performance boost as well, becauseof how ORC will do run-length packing of the same valued columns. I'djust sort/bucket on src for all of them.

You can see a similar organization for easier big table JOINs here -https://github.com/t3rmin4t0r/nyc-taxi-bigdata/blob/master/ddl/orc.sql#L26


HTH.

Cheers,
Gopal

Re: Tuning Triangle Joins on Hive

Reply via email to