Tuning Triangle Joins on Hive

Firas Abuzaid Thu, 31 Jul 2014 12:29:39 -0700

Hi,

We're running various "triangle" join queries on Hive 0.9.0, and we're
wondering if we can get any better performance. Here's the query we're
running:


SELECT count(*)
FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst =
r3.src AND r3.dst = r1.src)
WHERE r1.src < r2.src AND r2.src < r3.src;

We're currently passing the following tuning parameters as well:

set mapred.map.tasks=120;
set mapred.reduce.tasks=120;
set mapred.tasktracker.map.tasks.maximum=8;
set mapred.tasktracker.reduce.tasks.maximum=8;
set mapred.child.java.opts=-Xmx5120m;

The dataset we're using has 5 million nodes and 70 million edges, and most
of our time is spent on garbage collection. We have about 30 machines in
our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
improve performance? Thanks in advance!

Tuning Triangle Joins on Hive

Reply via email to