If you are using Amazon EMR, you can set hive.optimize.s3.query=true to speed up part b. See https://forums.aws.amazon.com/ann.jspa?annID=1105 for more info.
From: Ashutosh Chauhan [mailto:hashut...@apache.org] Sent: Thursday, October 20, 2011 1:21 PM To: user@hive.apache.org Subject: Re: Running hive on large number of files in S3 Hey Thulasi, There are two factors which may affect job startup time in case of large number of partitions: a) Getting partition info from metastore: Hive stores metadata about each partiton in metastore. Depending on number of partitions, it needs to fetch, that can take some time. b) Input split calculation by Job client for all the map tasks. Majority of this will be spent in getting FileStatus for files from underlying filesystem. If your use case is that you process all the partitions then having fewer large partitions will help for both of these factors. On the other hand, if you process only few partitions usually, then doing finer grained partitioned is better. Hope it helps, Ashutosh On Thu, Oct 20, 2011 at 13:10, Thulasi Ram Naidu Peddineni <thulasiram...@gmail.com<mailto:thulasiram...@gmail.com>> wrote: Hi All, I have a use-case where I will be joining table1 with table2. These are external tables with data in S3. table2 has many partitions (say 10K) size being around 2GB and table1 has around 5-10 partitions around 1-2MB. When I am joining these two tables, I observed that it is taking lot of time to execute the query (more than 20 minutes). >From my observation, the actual job execution is not taking lot of time but the bottle neck is starting the job itself. This makes me think that hive prefetching all the data from S3 and then do the processing. Can some one explain me why is hive job is not starting ontime on an external table with too many-partitions ? One more observation here is, if I reduce the number of partitions with same amount of data, the whole query is executing faster. And what is the recommended way in such a scenario. ----- Thanks, Thulasi Ram P