There is a blogspot regarding S3 optimization. You might find this post useful.
https://www.quora.com/How-does-Qubole-improve-S3-performance On Fri, Sep 18, 2015 at 2:24 AM, Sreenath <sreenaths1...@gmail.com> wrote: > Hi All, > > Something interesting fell to my notice last day when i was using hive for > some queries. The time taken by hive to launch a mapreduce job was > manifolds higher than the time taken by hadoop to actually execute it. > This is the table details on which the query is being fired. > > CREATE EXTERNAL TABLE A > ( > user_id string, > stage strig, > url string > ) > PARTITIONED BY (dt string , id string) > > All the data for table is stored in S3 and each day there will be around > 2000 unique id i.e 2000 partitions being added daily. And we can assume > that each partition has on a average 100MB gzip compressed data. > Now when I run a query like "SELECT DISTINCT user_id FROM A WHERE > dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx > 60000 partitions it takes hive approximately 2 hrs to launch the map reduce > job and the launched job just finishes in 20 min. So was wondering if > someone can help me in understanding what hive is doing in this 2 hrs ? > Would really appreciate some help here . Thanks in advance !!!! > > > Best, > Sreenath >