RE: Running hive on large number of files in S3

Steven Wong Thu, 20 Oct 2011 14:20:21 -0700

If you are using Amazon EMR, you can set hive.optimize.s3.query=true to speed 
up part b. See https://forums.aws.amazon.com/ann.jspa?annID=1105 for more info.

From: Ashutosh Chauhan [mailto:hashut...@apache.org]
Sent: Thursday, October 20, 2011 1:21 PM
To: user@hive.apache.org
Subject: Re: Running hive on large number of files in S3

Hey Thulasi,

There are two factors which may affect job startup time in case of large number 
of partitions:
a) Getting partition info from metastore:  Hive stores metadata about each 
partiton in metastore. Depending on number of partitions, it needs to fetch, 
that can take some time.
b) Input split calculation by Job client for all the map tasks. Majority of 
this will be spent in getting FileStatus for files from underlying filesystem.

If your use case is that you process all the partitions then having fewer large 
partitions will help for both of these factors. On the other hand, if you 
process only few partitions usually, then doing finer grained partitioned is 
better.

Hope it helps,
Ashutosh
On Thu, Oct 20, 2011 at 13:10, Thulasi Ram Naidu Peddineni 
<thulasiram...@gmail.com<mailto:thulasiram...@gmail.com>> wrote:
Hi All,
   I have a use-case where I will be joining table1 with table2.
These are external tables with data in S3. table2 has many partitions
(say 10K) size being around 2GB and table1 has around 5-10 partitions
around 1-2MB. When I am joining these two tables, I observed that it
is taking lot of time to execute the query (more than 20 minutes).
>From my observation, the actual job execution is not taking lot of
time but the bottle neck is starting the job itself. This makes me
think that hive prefetching all the data from S3 and then do the
processing. Can some one explain me why is hive job is not starting
ontime on an external table with too many-partitions ?
 One more observation here is, if I reduce the number of partitions
with same amount of data, the whole query is executing faster.

And what is the recommended way in such a scenario.

-----
Thanks,
Thulasi Ram P

RE: Running hive on large number of files in S3

Reply via email to