Re: Tuning Triangle Joins on Hive

chandra Reddy Bogala Fri, 01 Aug 2014 23:08:27 -0700

How does indexes work in hive? I thought file formats like ORC have indexes
in each block. But not a separate index that can help query performance.
Thanks,
Chandra



On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <devo...@gmail.com> wrote:

> Please try the following approach and let me know if you are not getting
> better performance:
>
> 1. Ensure indexes are present on dst , rsc columns in the respective
> tables.
> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
> physical table, and then create index on its new src column as well
> 3. Join this to r1
>
> If this approach works well, then try out the WITH SELECT ... using the
> same approach , just no physical intermediate table will be created.
>
> Hope it helps..
>
> regards
> Dev
>
>
>
>
> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu>
> wrote:
>
>> Hi,
>>
>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>> wondering if we can get any better performance. Here's the query we're
>> running:
>>
>> SELECT count(*)
>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
>> = r3.src AND r3.dst = r1.src)
>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>
>> We're currently passing the following tuning parameters as well:
>>
>> set mapred.map.tasks=120;
>> set mapred.reduce.tasks=120;
>> set mapred.tasktracker.map.tasks.maximum=8;
>> set mapred.tasktracker.reduce.tasks.maximum=8;
>> set mapred.child.java.opts=-Xmx5120m;
>>
>> The dataset we're using has 5 million nodes and 70 million edges, and
>> most of our time is spent on garbage collection. We have about 30 machines
>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>> can improve performance? Thanks in advance!
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>

Re: Tuning Triangle Joins on Hive

Reply via email to