Re: Partition performance

Ian Fri, 05 Apr 2013 11:36:33 -0700

Thanks. This is just a test from my local box. So each file is only 1kb. I 
shared the query plans of these two tests at:
http://codetidy.com/paste/raw/5198
http://codetidy.com/paste/raw/5199
 
Also in the Hadoop log, there is this line for each 
partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to 
work list for file hdfs://localhost:8020/test1/2011/02/01/01
Does that mean each partition will become a map task?
 
I'm still new in Hive, just wondering what are the common strategy for 
partitioning the hourly logs? I know we shouldn't have too many partitions but 
I'm wondering what's the reason behind it? If I run this on a real cluster, 
maybe it won't perform so differently?
 
Thanks.


________________________________
 From: Dean Wampler <dean.wamp...@thinkbiganalytics.com>
To: user@hive.apache.org 
Sent: Thursday, April 4, 2013 4:28 PM
Subject: Re: Partition performance
  

Also, how big are the files in each directory? Are they roughly the size of one 
HDFS block or a multiple. Lots of small files will mean lots of mapper tasks 
will little to do.

You can also compare the job tracker console output for each job. I bet the 
slow one has a lot of very short map and reduce tasks, while the faster one has 
fewer tasks that run longer. A rule of thumb is that any one task should take 
20 seconds or more to amortize over the few seconds spent in start up per task. 

In other words, if you think about what's happening at the HDFS and MR level, 
you can learn to predict how fast or slow things will run. Learning to read the 
output of EXPLAIN or EXPLAIN EXTENDED helps with this. 

dean


On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley <omal...@apache.org> wrote:

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.
>
>-- Owen
>
>
>
>On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ramki.pa...@gmail.com> wrote:
>
>Is it possible for you to send the explain plan of these two queries?
>>
>>Regards,
>>Ramki.
>>
>>
>>
>>
>>On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
>><sanjay.subraman...@wizecommerce.com> wrote:
>>
>>The slow down is most possibly due to large number of partitions. 
>>>I believe the Hive book authors tell us to be cautious with large number of 
>>>partitions :-)  and I abide by that. 
>>>
>>> 
>>>Users 
>>>Please add your points of view and experiences 
>>>
>>> 
>>>Thanks 
>>>sanjay 
>>>
>>> From: Ian <liu...@yahoo.com>
>>>Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Ian 
>>><liu...@yahoo.com>
>>>Date: Thursday, April 4, 2013 4:01 PM
>>>To: "user@hive.apache.org" <user@hive.apache.org>
>>>Subject: Partition performance
>>>
>>>
>>> 
>>>Hi, 
>>>
>>>I created 3 years of hourly log files (totally 26280 files), and use 
>>>External Table with partition to query. I tried two partition methods. 
>>>
>>>1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per 
>>>hour). Use date and hour as partition keys. Add 3 years of directories to 
>>>the table partitions. So there are 26280 partitions. 
>>>        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt 
>>>string, hr int); 
>>>        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
>>>'/test1/2013/04/02/16'; 
>>>  
>>>2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per 
>>>day, 24 files in each directory). Use date as partition key. Add 3 years of 
>>>directories to the table partitions. So there are 1095 partitions.         
>>>CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); 
>>>        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
>>>'/test2/2013/04/02'; 
>>>  
>>>When doing a simple query like  
>>>    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= 
>>>'2013-02-14'  
>>>Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.  
>>>
>>>I'm wondering why there is a big performance difference between these two? 
>>>These two approaches have the same number of files, only the directory 
>>>structure is different. So Hive is going to load the same amount of files. 
>>>Why does the number of partitions have such big impact? Does that mean #2 is 
>>>a better partition strategy? 
>>>  
>>>Thanks.  
>>>
>>>   
>>>
>>>CONFIDENTIALITY NOTICE
>>>======================
>>>This email message and any attachments are for the exclusive use of the 
>>>intended recipient(s) and may contain confidential and privileged 
>>>information. Any unauthorized review, use, disclosure or distribution is 
>>>prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the 
original message along with any attachments, from your computer system. If you 
are the intended recipient, please be advised that the content of this message 
is subject to access, review
 and disclosure by the sender's Email System Administrator.
>>>  
>> 
> 


-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330

Re: Partition performance

Reply via email to