Hi All, Something interesting fell to my notice last day when i was using hive for some queries. The time taken by hive to launch a mapreduce job was manifolds higher than the time taken by hadoop to actually execute it. This is the table details on which the query is being fired.
CREATE EXTERNAL TABLE A ( user_id string, stage strig, url string ) PARTITIONED BY (dt string , id string) All the data for table is stored in S3 and each day there will be around 2000 unique id i.e 2000 partitions being added daily. And we can assume that each partition has on a average 100MB gzip compressed data. Now when I run a query like "SELECT DISTINCT user_id FROM A WHERE dt>='20150101' and dt <= '20150401'" ie over a period of 3 months approx 60000 partitions it takes hive approximately 2 hrs to launch the map reduce job and the launched job just finishes in 20 min. So was wondering if someone can help me in understanding what hive is doing in this 2 hrs ? Would really appreciate some help here . Thanks in advance !!!! Best, Sreenath