Ariel, set hive.optimize.s3.query=true seems to help. But I'm surprised, because the information I can find online about that config suggests that it is related to tables with a large number of partitions. I have a lot of files, but only one partition. But it seems to help anyway.
Abdelrahman, Thanks for the logging tip. I do want to know what it is doing, so this should be helpful. Marc On Wed, Jan 30, 2013 at 3:23 PM, Abdelrahman Shettia < ashet...@hortonworks.com> wrote: > Hi Marc, > > You can try running the hive client with debug mode on and see what is > trying to do on the JT level. > hive -hiveconf hive.root.logger=ALL,console -e " DDL;" > hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ; > > Hope this helps . > > Thanks > -Abdelrahman > > > On Wed, Jan 30, 2013 at 3:16 PM, Marc Limotte <mslimo...@gmail.com> wrote: > >> Hi, >> >> I'm running in Amazon on an EMR cluster with hive 0.8.1. We have a lot >> of other Hadoop jobs, but only started experimenting with Hive recently. >> >> I've been seeing a long pause after submitting a hive query and the >> actually start of the hadoop job... 10 minutes or more in some cases. I'm >> wondering what's happening during this time. Either a high level answer, >> or maybe there is some logging I can turn on? >> >> Here's some more detail. I submit the query on the master using the hive >> cli, and start to see some output right away... >> >> Total MapReduce jobs = 2 >> Launching Job 1 out of 2 >> Number of reduce tasks not specified. Estimated from input data size: 1 >> In order to change the average load for a reducer (in bytes): >> set hive.exec.reducers.bytes.per.reducer=<number> >> In order to limit the maximum number of reducers: >> set hive.exec.reducers.max=<number> >> In order to set a constant number of reducers: >> set mapred.reduce.tasks=<number> >> >> >> *[then a long delay here: 10 minutes or more... no activity in the >> hadoop job tracker ui] * >> >> >> … and then it continues normally ... >> Starting Job = job_201301160029_0082, Tracking URL = >> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082 >> Kill Command = /home/hadoop/bin/hadoop job >> -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082 >> Hadoop job information for Stage-1: number of mappers: 2; number of >> reducers: 1 >> 2013-01-30 20:45:30,526 Stage-1 map = 0%, reduce = 0% >> … >> >> This query is processing in the neighborhood of 500GB of data from S3. A >> couple of possibilities I thought of… perhaps someone can confirm or deny: >> a) Is the data copied from S3 to HDFS during this time? >> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175 >> MB)-- does it have to copy these around to the tasks at this time? >> >> Any insights appreciated. >> >> Marc >> >> >> >> >