There is a blogspot regarding S3 optimization. You might find this post
useful.

https://www.quora.com/How-does-Qubole-improve-S3-performance

On Fri, Sep 18, 2015 at 2:24 AM, Sreenath <sreenaths1...@gmail.com> wrote:

> Hi All,
>
> Something interesting fell to my notice last day when i was using hive for
> some queries. The time taken by hive to launch a mapreduce job was
> manifolds higher than the time taken by hadoop to actually execute it.
> This is the table details on which the query is being fired.
>
> CREATE EXTERNAL TABLE A
> (
>     user_id string,
>     stage strig,
>     url string
> )
> PARTITIONED BY (dt string , id string)
>
> All the data for table is stored in S3 and each day there will be around
> 2000 unique id i.e 2000 partitions being added daily. And we can assume
> that each partition has on a average 100MB gzip compressed data.
> Now when I run a query like "SELECT DISTINCT user_id FROM A  WHERE
> dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx
> 60000 partitions it takes hive approximately 2 hrs to launch the map reduce
> job and the launched job just finishes in 20 min. So was wondering if
> someone can help me in understanding what hive is doing in this 2 hrs ?
> Would really appreciate some help here . Thanks in advance !!!!
>
>
> Best,
> Sreenath
>

Reply via email to