Ariel,
set hive.optimize.s3.query=true seems to help. But I'm surprised, because
the information I can find online about that config suggests that it is
related to tables with a large number of partitions. I have a lot of files,
but only one partition. But it seems to help anyway.

Abdelrahman,
Thanks for the logging tip. I do want to know what it is doing, so this
should be helpful.

Marc

On Wed, Jan 30, 2013 at 3:23 PM, Abdelrahman Shettia <
ashet...@hortonworks.com> wrote:

> Hi Marc,
>
> You can try running the hive client with debug mode on and see what is
> trying to do on the JT level.
> hive -hiveconf hive.root.logger=ALL,console -e " DDL;"
> hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ;
>
> Hope this helps .
>
> Thanks
> -Abdelrahman
>
>
> On Wed, Jan 30, 2013 at 3:16 PM, Marc Limotte <mslimo...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running in Amazon on an EMR cluster with hive 0.8.1.  We have a lot
>> of other Hadoop jobs, but only started experimenting with Hive recently.
>>
>> I've been seeing a long pause after submitting a hive query and the
>> actually start of the hadoop job... 10 minutes or more in some cases.  I'm
>> wondering what's happening during this time.  Either a high level answer,
>> or maybe there is some logging I can turn on?
>>
>> Here's some more detail.  I submit the query on the master using the hive
>> cli, and start to see some output right away...
>>
>> Total MapReduce jobs = 2
>> Launching Job 1 out of 2
>> Number of reduce tasks not specified. Estimated from input data size: 1
>> In order to change the average load for a reducer (in bytes):
>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>>   set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>>   set mapred.reduce.tasks=<number>
>>
>>
>> *[then a long delay here: 10 minutes or more... no activity in the
>> hadoop job tracker ui] *
>>
>>
>> … and then it continues normally ...
>> Starting Job = job_201301160029_0082, Tracking URL =
>> http://ip-xxxxxxxx.ec2.internal:9100/jobdetails.jsp?jobid=job_201301160029_0082
>> Kill Command = /home/hadoop/bin/hadoop job
>>  -Dmapred.job.tracker=xxxxxx:9001 -kill job_201301160029_0082
>> Hadoop job information for Stage-1: number of mappers: 2; number of
>> reducers: 1
>> 2013-01-30 20:45:30,526 Stage-1 map = 0%,  reduce = 0%
>> …
>>
>> This query is processing in the neighborhood of 500GB of data from S3.  A
>> couple of possibilities I thought of… perhaps someone can confirm or deny:
>> a) Is the data copied from S3 to HDFS during this time?
>> b) I have a fairly large set of libs in HIVE_AUX_JAR_PATH (around ~175
>> MB)-- does it have to copy these around to the tasks at this time?
>>
>> Any insights appreciated.
>>
>> Marc
>>
>>
>>
>>
>

Reply via email to