How does indexes work in hive? I thought file formats like ORC have indexes in each block. But not a separate index that can help query performance. Thanks, Chandra
On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <devo...@gmail.com> wrote: > Please try the following approach and let me know if you are not getting > better performance: > > 1. Ensure indexes are present on dst , rsc columns in the respective > tables. > 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a > physical table, and then create index on its new src column as well > 3. Join this to r1 > > If this approach works well, then try out the WITH SELECT ... using the > same approach , just no physical intermediate table will be created. > > Hope it helps.. > > regards > Dev > > > > > On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu> > wrote: > >> Hi, >> >> We're running various "triangle" join queries on Hive 0.9.0, and we're >> wondering if we can get any better performance. Here's the query we're >> running: >> >> SELECT count(*) >> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst >> = r3.src AND r3.dst = r1.src) >> WHERE r1.src < r2.src AND r2.src < r3.src; >> >> We're currently passing the following tuning parameters as well: >> >> set mapred.map.tasks=120; >> set mapred.reduce.tasks=120; >> set mapred.tasktracker.map.tasks.maximum=8; >> set mapred.tasktracker.reduce.tasks.maximum=8; >> set mapred.child.java.opts=-Xmx5120m; >> >> The dataset we're using has 5 million nodes and 70 million edges, and >> most of our time is spent on garbage collection. We have about 30 machines >> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we >> can improve performance? Thanks in advance! >> > > > > -- > Devopam Mittra > Life and Relations are not binary >