RE: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering

Ladda, Anand Tue, 03 Apr 2012 06:33:46 -0700

Thanks Bejoy and Nitin. I've read through the join presentation by Namit Jain 
and Liyin Tang from Facebook and got some ideas on how to improve the join 
performance.


*         I understand how Map Joins work but wasn't clear on the workflow of 
bucketed map joins. Is having map join enabled a pre-requisite for bucketed map 
joins i.e, do I need to set both set hive.auto.convert.join=true; and set 
hive.optimize.bucketmapjoin = true; in order for bucketed map joins to work. 
From what I understand bucketed map joins are meant for the scenario when 
neither table in the join is "small enough" to be map join enabled. In that 
case if you have the tables bucketed on the same columns (and the buckets are 
multiples of each other) then you can use the bucketmapjoin technique to help 
improve the performance. Is this accurate?

*         Also you mention improving performance of "group by" queries. Are you 
referring to the use of map side aggregation? Any resources you can point me to 
where I can study this further?
Thanks
Anand


From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Sunday, April 01, 2012 5:35 PM
To: user@hive.apache.org
Subject: Re: Hive Queries Performance Tuning - Map side joins, Map side 
aggregations, Partitioning/Clustering

Anand
     You can optimize pretty much all hive queries. Based on your queries you 
need to do the optimizations. For example Group By has some specific way to be 
optimized. Some times Distribute By comes in handy for optimizing some queries. 
Skew joins are good to balace the reducer loads. etc
     Map joins are used if one of the table's involved in the join is small. 
For medium sized bucketed tables you can go in for bucketed map join (with some 
conditions on number of buckets and bucketed columns to join columns).

Regards
Bejoy KS

________________________________
From: "Ladda, Anand" <lan...@microstrategy.com<mailto:lan...@microstrategy.com>>
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Sent: Sunday, April 1, 2012 11:59 PM
Subject: Hive Queries Performance Tuning - Map side joins, Map side 
aggregations, Partitioning/Clustering


I am trying to understand what are some of the options/settings available to 
tune the performance of Hive Queries. I have seen the benefits of Map side 
joins and Partitioning/Clustering. However I have yet to realize the impact map 
side aggregation has on query performance. I tried running this query against 
with and without map-side join turned on and did not see much difference in the 
execution times. The raw data in this partition is about 5.5 million. Looking 
for some pointers to see what type of queries benefit from Map-side aggregation


set hive.auto.convert.join=false;


set hive.map.aggr=false;

Non-partitioned, non-clustered single table with where clause on date and no 
map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

400 secs


set hive.map.aggr=true;

Non-partitioned, non-clustered single table with where clause with where clause 
on date and map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

390 secs


Also is there any reason to not turn on map-side joins all the time. In my 
tests I have always seen the performance either be the same or improve with 
map-side joins turned on. Are there any other parameters or Hive features that 
can help improve the performance of Hive queries.
Thanks
Anand

RE: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering

Reply via email to