Thanks, that's very helpful!
On Sat, Aug 2, 2014 at 12:47 PM, Lefty Leverenz <[email protected]> wrote: > How does indexes work in hive? >> > > See the Indexes design doc > <https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive > wiki, although it hasn't been updated. > > -- Lefty > > > On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala < > [email protected]> wrote: > >> How does indexes work in hive? I thought file formats like ORC have >> indexes in each block. But not a separate index that can help query >> performance. >> Thanks, >> Chandra >> >> >> On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <[email protected]> wrote: >> >>> Please try the following approach and let me know if you are not getting >>> better performance: >>> >>> 1. Ensure indexes are present on dst , rsc columns in the respective >>> tables. >>> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a >>> physical table, and then create index on its new src column as well >>> 3. Join this to r1 >>> >>> If this approach works well, then try out the WITH SELECT ... using the >>> same approach , just no physical intermediate table will be created. >>> >>> Hope it helps.. >>> >>> regards >>> Dev >>> >>> >>> >>> >>> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> We're running various "triangle" join queries on Hive 0.9.0, and we're >>>> wondering if we can get any better performance. Here's the query we're >>>> running: >>>> >>>> SELECT count(*) >>>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON >>>> (r2.dst = r3.src AND r3.dst = r1.src) >>>> WHERE r1.src < r2.src AND r2.src < r3.src; >>>> >>>> We're currently passing the following tuning parameters as well: >>>> >>>> set mapred.map.tasks=120; >>>> set mapred.reduce.tasks=120; >>>> set mapred.tasktracker.map.tasks.maximum=8; >>>> set mapred.tasktracker.reduce.tasks.maximum=8; >>>> set mapred.child.java.opts=-Xmx5120m; >>>> >>>> The dataset we're using has 5 million nodes and 70 million edges, and >>>> most of our time is spent on garbage collection. We have about 30 machines >>>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we >>>> can improve performance? Thanks in advance! >>>> >>> >>> >>> >>> -- >>> Devopam Mittra >>> Life and Relations are not binary >>> >> >> >
