Hi Abshiek

You can have a look at join optimizations as well as group by optimizations

Join optimization - Based on your data sets you can go in with map side join or 
bucketed map join or
to enable map join -> set hive.auto.convert.join = true;

to enable bucketed map join ->  set hive.optimize.bucketmapjoin = true (    
The prerequisite here is both the tables should be bucketed on the join column.)
If the data in buckets are sorted then you can go in with a sort merge join as 
well, you need to enable the following properties
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
set hive.optimize.bucketmapjoin = true; set 
hive.optimize.bucketmapjoin.sortedmerge = true;

For details you can refer the following url
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Group By OPtimization - You can go ahead with a few group by optimizations as 
well. A few pointers in here
http://mail-archives.apache.org/mod_mbox/hive-user/201209.mbox/%3cb55ff166-239e-4e39-bf92-3ae59eb78...@gmail.com%3E



Hive Indexes - Join and Group by gets optimized better with buckets. Based on 
your query you need to pre determine how your tables need to be bucketed. 
Indexing also gives you great performance advantage over queries that involves 
group by and where. Join optimization using indexes is in progress
https://issues.apache.org/jira/browse/HIVE-2845



RC file or Sequence File is a choice to be made based on the query patterns. If 
you are querying only a few columns then RC files gives you a performance edge 
but if the queries are spanned across pretty much all columns then use the more 
generalized Sequence Files.


 

Regards,
Bejoy KS


________________________________
 From: Abhishek <abhishek.dod...@gmail.com>
To: Hive <user@hive.apache.org> 
Sent: Thursday, September 27, 2012 7:03 PM
Subject: Performance tuning in hive
 
Hi all,

I am trying to increase the performance of some queries in hive, all queries 
mostly contain left outer join , group by and conditional checks, union all. I 
have over riden some properities in hive shell 

Set io.sort.mb=512
Set io.sort.factor=100
Set mapred.child.jvm.opts=-Xmx2048mb
Set hive.map.aggr=true
Set hive.exec.parallel=true
Set mapred.tasks.reuse.num.tasks=-1
Set hive.mapred.map.speculative.execution=false
Set hive.mapred.reduce.speculative.execution=false

I got some performance gain.

Still want to improve the performance of these queries

Which of the following gives me better performance 

Rcfile
Indexing
Bucketing
Sequence file 
Combination of above

Or 

Some configuration parameter tuning

Which one from above yields good performance??

Thanks in advance.

Regards
Abhi

Reply via email to