Hi Abshiek I don't think Partition By and Clustered By is supported in CTAS.
You need to create the bucketed Table separately, then enable hive.enforce.bucketing , after that use Select statement from the parent table to load data into the bucketed one. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Abhishek <abhishek.dod...@gmail.com> Date: Fri, 28 Sep 2012 11:14:56 To: Bejoy Ks<bejoy...@yahoo.com> Reply-To: user@hive.apache.org Cc: user@hive.apache.org<user@hive.apache.org> Subject: Re: Performance tuning in hive Hi Bejoy, How to use CTAS with Clustered By. I am getting following error when doing Create table as select CTAS does not support partitioning in the target table. Regards Abhi Sent from my iPhone On Sep 28, 2012, at 5:32 AM, Bejoy KS <bejoy...@yahoo.com> wrote: > Hi Abshiek > > Which optimization you have to choose totally depends o your queries or the > kind of queries fired on those tables. Based on that you need to bucket and > index them to get better performance. From a birds eye point of view, > bucketing + indexing + map joins would be a good combination if those suits > your data set. > > Regards, > Bejoy KS > > From: Abhishek <abhishek.dod...@gmail.com> > To: "user@hive.apache.org" <user@hive.apache.org> > Cc: "user@hive.apache.org" <user@hive.apache.org> > Sent: Friday, September 28, 2012 5:16 AM > Subject: Re: Performance tuning in hive > > Hi Bejoy, > > Thanks for the reply.Can I know whether combination of > 1) Indexing and Bucketing > Or > 2) bucketing with Rc file > Or > 3) sequence file with bucketing and indexing > Or > 4) map join with indexes > Or > > Any other combination of above mentioned or non mentioned, would fetch a > better performance. > > Regards > Abhi > > Sent from my iPhone > > On Sep 27, 2012, at 2:44 PM, Bejoy KS <bejoy...@yahoo.com> wrote: > >> Hi Abshiek >> >> You can have a look at join optimizations as well as group by optimizations >> >> Join optimization - Based on your data sets you can go in with map side join >> or bucketed map join or >> to enable map join -> set hive.auto.convert.join = true; >> >> to enable bucketed map join -> set hive.optimize.bucketmapjoin = true ( >> The prerequisite here is both the tables should be bucketed on the join >> column.) >> If the data in buckets are sorted then you can go in with a sort merge join >> as well, you need to enable the following properties >> set >> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; >> set hive.optimize.bucketmapjoin = true; >> set hive.optimize.bucketmapjoin.sortedmerge = true; >> >> For details you can refer the following url >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins >> >> Group By OPtimization - You can go ahead with a few group by optimizations >> as well. A few pointers in here >> http://mail-archives.apache.org/mod_mbox/hive-user/201209.mbox/%3cb55ff166-239e-4e39-bf92-3ae59eb78...@gmail.com%3E >> >> >> Hive Indexes - Join and Group by gets optimized better with buckets. Based >> on your query you need to pre determine how your tables need to be bucketed. >> Indexing also gives you great performance advantage over queries that >> involves group by and where. Join optimization using indexes is in progress >> https://issues.apache.org/jira/browse/HIVE-2845 >> >> >> RC file or Sequence File is a choice to be made based on the query patterns. >> If you are querying only a few columns then RC files gives you a performance >> edge but if the queries are spanned across pretty much all columns then use >> the more generalized Sequence Files. >> >> >> Regards, >> Bejoy KS >> >> From: Abhishek <abhishek.dod...@gmail.com> >> To: Hive <user@hive.apache.org> >> Sent: Thursday, September 27, 2012 7:03 PM >> Subject: Performance tuning in hive >> >> Hi all, >> >> I am trying to increase the performance of some queries in hive, all queries >> mostly contain left outer join , group by and conditional checks, union all. >> I have over riden some properities in hive shell >> >> Set io.sort.mb=512 >> Set io.sort.factor=100 >> Set mapred.child.jvm.opts=-Xmx2048mb >> Set hive.map.aggr=true >> Set hive.exec.parallel=true >> Set mapred.tasks.reuse.num.tasks=-1 >> Set hive.mapred.map.speculative.execution=false >> Set hive.mapred.reduce.speculative.execution=false >> >> I got some performance gain. >> >> Still want to improve the performance of these queries >> >> Which of the following gives me better performance >> >> Rcfile >> Indexing >> Bucketing >> Sequence file >> Combination of above >> >> Or >> >> Some configuration parameter tuning >> >> Which one from above yields good performance?? >> >> Thanks in advance. >> >> Regards >> Abhi > >