Re: Running hive on large number of files in S3

Jerome Boulon Thu, 20 Oct 2011 13:16:39 -0700

Hi,
I don't think that your job is actually prefetching the data while you're
waiting.
If you have a large number of partitions then getting the list of files to
compute the split
(aka prefetching the filenames from S3) is what is taking for ever.
If you have a premium support from amazon you may want to ask for help in
this area.


/Jerome

On 10/20/11 1:10 PM, "Thulasi Ram Naidu Peddineni"
<thulasiram...@gmail.com> wrote:

>Hi All,
>    I have a use-case where I will be joining table1 with table2.
>These are external tables with data in S3. table2 has many partitions
>(say 10K) size being around 2GB and table1 has around 5-10 partitions
>around 1-2MB. When I am joining these two tables, I observed that it
>is taking lot of time to execute the query (more than 20 minutes).
>From my observation, the actual job execution is not taking lot of
>time but the bottle neck is starting the job itself. This makes me
>think that hive prefetching all the data from S3 and then do the
>processing. Can some one explain me why is hive job is not starting
>ontime on an external table with too many-partitions ?
>  One more observation here is, if I reduce the number of partitions
>with same amount of data, the whole query is executing faster.
>
>And what is the recommended way in such a scenario.
>
>-----
>Thanks,
>Thulasi Ram P
>

Re: Running hive on large number of files in S3

Reply via email to