Hi, I don't think that your job is actually prefetching the data while you're waiting. If you have a large number of partitions then getting the list of files to compute the split (aka prefetching the filenames from S3) is what is taking for ever. If you have a premium support from amazon you may want to ask for help in this area.
/Jerome On 10/20/11 1:10 PM, "Thulasi Ram Naidu Peddineni" <thulasiram...@gmail.com> wrote: >Hi All, > I have a use-case where I will be joining table1 with table2. >These are external tables with data in S3. table2 has many partitions >(say 10K) size being around 2GB and table1 has around 5-10 partitions >around 1-2MB. When I am joining these two tables, I observed that it >is taking lot of time to execute the query (more than 20 minutes). >From my observation, the actual job execution is not taking lot of >time but the bottle neck is starting the job itself. This makes me >think that hive prefetching all the data from S3 and then do the >processing. Can some one explain me why is hive job is not starting >ontime on an external table with too many-partitions ? > One more observation here is, if I reduce the number of partitions >with same amount of data, the whole query is executing faster. > >And what is the recommended way in such a scenario. > >----- >Thanks, >Thulasi Ram P >