Re: Tuning Triangle Joins on Hive

Devopam Mittra Thu, 31 Jul 2014 20:41:29 -0700

Please try the following approach and let me know if you are not getting
better performance:

1. Ensure indexes are present on dst , rsc columns in the respective tables.
2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
physical table, and then create index on its new src column as well
3. Join this to r1

If this approach works well, then try out the WITH SELECT ... using the
same approach , just no physical intermediate table will be created.

Hope it helps..

regards
Dev

On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu>
wrote:

> Hi,
>
> We're running various "triangle" join queries on Hive 0.9.0, and we're
> wondering if we can get any better performance. Here's the query we're
> running:
>
> SELECT count(*)
> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
> = r3.src AND r3.dst = r1.src)
> WHERE r1.src < r2.src AND r2.src < r3.src;
>
> We're currently passing the following tuning parameters as well:
>
> set mapred.map.tasks=120;
> set mapred.reduce.tasks=120;
> set mapred.tasktracker.map.tasks.maximum=8;
> set mapred.tasktracker.reduce.tasks.maximum=8;
> set mapred.child.java.opts=-Xmx5120m;
>
> The dataset we're using has 5 million nodes and 70 million edges, and most
> of our time is spent on garbage collection. We have about 30 machines in
> our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
> improve performance? Thanks in advance!
>

-- 
Devopam Mittra
Life and Relations are not binary

Re: Tuning Triangle Joins on Hive

Reply via email to