Hi Abshiek

I don't think Partition By and Clustered By  is supported in CTAS.

You need to create the bucketed
Table separately, then enable hive.enforce.bucketing , after that use Select 
statement from the parent table to load data into the bucketed one.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Abhishek <abhishek.dod...@gmail.com>
Date: Fri, 28 Sep 2012 11:14:56 
To: Bejoy Ks<bejoy...@yahoo.com>
Reply-To: user@hive.apache.org
Cc: user@hive.apache.org<user@hive.apache.org>
Subject: Re: Performance tuning in hive

Hi Bejoy,

How to use CTAS with Clustered By. 

I am getting following error when doing

Create table as select

CTAS does not support partitioning in the target table.

Regards
Abhi

Sent from my iPhone

On Sep 28, 2012, at 5:32 AM, Bejoy KS <bejoy...@yahoo.com> wrote:

> Hi Abshiek
> 
> Which optimization you have to choose totally depends o your queries or the 
> kind of queries fired on those tables. Based on that you need to bucket and 
> index them to get better performance. From a birds eye point of view, 
> bucketing + indexing + map joins would be a good combination if those suits 
> your data set.
>  
> Regards,
> Bejoy KS
> 
> From: Abhishek <abhishek.dod...@gmail.com>
> To: "user@hive.apache.org" <user@hive.apache.org> 
> Cc: "user@hive.apache.org" <user@hive.apache.org> 
> Sent: Friday, September 28, 2012 5:16 AM
> Subject: Re: Performance tuning in hive
> 
> Hi Bejoy,
> 
> Thanks for the reply.Can I know whether combination of
> 1) Indexing and Bucketing  
>        Or
> 2) bucketing with Rc file
>      Or
> 3) sequence file with bucketing and indexing
>    Or
> 4) map join with indexes 
>   Or
> 
> Any other combination of above mentioned or non mentioned, would fetch a 
> better performance.
> 
> Regards
> Abhi
> 
> Sent from my iPhone
> 
> On Sep 27, 2012, at 2:44 PM, Bejoy KS <bejoy...@yahoo.com> wrote:
> 
>> Hi Abshiek
>> 
>> You can have a look at join optimizations as well as group by optimizations
>> 
>> Join optimization - Based on your data sets you can go in with map side join 
>> or bucketed map join or
>> to enable map join -> set hive.auto.convert.join = true;
>> 
>> to enable bucketed map join ->  set hive.optimize.bucketmapjoin = true (    
>> The prerequisite here is both the tables should be bucketed on the join 
>> column.)
>> If the data in buckets are sorted then you can go in with a sort merge join 
>> as well, you need to enable the following properties
>>  set 
>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>   set hive.optimize.bucketmapjoin = true;
>>   set hive.optimize.bucketmapjoin.sortedmerge = true;
>> 
>> For details you can refer the following url
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>> 
>> Group By OPtimization - You can go ahead with a few group by optimizations 
>> as well. A few pointers in here
>> http://mail-archives.apache.org/mod_mbox/hive-user/201209.mbox/%3cb55ff166-239e-4e39-bf92-3ae59eb78...@gmail.com%3E
>> 
>> 
>> Hive Indexes - Join and Group by gets optimized better with buckets. Based 
>> on your query you need to pre determine how your tables need to be bucketed. 
>> Indexing also gives you great performance advantage over queries that 
>> involves group by and where. Join optimization using indexes is in progress
>> https://issues.apache.org/jira/browse/HIVE-2845
>> 
>> 
>> RC file or Sequence File is a choice to be made based on the query patterns. 
>> If you are querying only a few columns then RC files gives you a performance 
>> edge but if the queries are spanned across pretty much all columns then use 
>> the more generalized Sequence Files.
>> 
>>  
>> Regards,
>> Bejoy KS
>> 
>> From: Abhishek <abhishek.dod...@gmail.com>
>> To: Hive <user@hive.apache.org> 
>> Sent: Thursday, September 27, 2012 7:03 PM
>> Subject: Performance tuning in hive
>> 
>> Hi all,
>> 
>> I am trying to increase the performance of some queries in hive, all queries 
>> mostly contain left outer join , group by and conditional checks, union all. 
>> I have over riden some properities in hive shell 
>> 
>> Set io.sort.mb=512
>> Set io.sort.factor=100
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
>> 
>> I got some performance gain.
>> 
>> Still want to improve the performance of these queries
>> 
>> Which of the following gives me better performance 
>> 
>> Rcfile
>> Indexing
>> Bucketing
>> Sequence file 
>> Combination of above
>> 
>> Or 
>> 
>> Some configuration parameter tuning
>> 
>> Which one from above yields good performance??
>> 
>> Thanks in advance.
>> 
>> Regards
>> Abhi
> 
> 

Reply via email to