Please try the following approach and let me know if you are not getting better performance:
1. Ensure indexes are present on dst , rsc columns in the respective tables. 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a physical table, and then create index on its new src column as well 3. Join this to r1 If this approach works well, then try out the WITH SELECT ... using the same approach , just no physical intermediate table will be created. Hope it helps.. regards Dev On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu> wrote: > Hi, > > We're running various "triangle" join queries on Hive 0.9.0, and we're > wondering if we can get any better performance. Here's the query we're > running: > > SELECT count(*) > FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst > = r3.src AND r3.dst = r1.src) > WHERE r1.src < r2.src AND r2.src < r3.src; > > We're currently passing the following tuning parameters as well: > > set mapred.map.tasks=120; > set mapred.reduce.tasks=120; > set mapred.tasktracker.map.tasks.maximum=8; > set mapred.tasktracker.reduce.tasks.maximum=8; > set mapred.child.java.opts=-Xmx5120m; > > The dataset we're using has 5 million nodes and 70 million edges, and most > of our time is spent on garbage collection. We have about 30 machines in > our cluster, and each machine has 45GB of RAM. Any thoughts on how we can > improve performance? Thanks in advance! > -- Devopam Mittra Life and Relations are not binary