Hi All,

Something interesting fell to my notice last day when i was using hive for
some queries. The time taken by hive to launch a mapreduce job was
manifolds higher than the time taken by hadoop to actually execute it.
This is the table details on which the query is being fired.

CREATE EXTERNAL TABLE A
(
    user_id string,
    stage strig,
    url string
)
PARTITIONED BY (dt string , id string)

All the data for table is stored in S3 and each day there will be around
2000 unique id i.e 2000 partitions being added daily. And we can assume
that each partition has on a average 100MB gzip compressed data.
Now when I run a query like "SELECT DISTINCT user_id FROM A  WHERE
dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx
60000 partitions it takes hive approximately 2 hrs to launch the map reduce
job and the launched job just finishes in 20 min. So was wondering if
someone can help me in understanding what hive is doing in this 2 hrs ?
Would really appreciate some help here . Thanks in advance !!!!


Best,
Sreenath

Reply via email to