RE: Hive on Amazon EC2 with S3

Suman.Addanki Thu, 30 Aug 2012 10:25:53 -0700

Thank you Joe. It works now. I will try to read up on the differences between 
CombineHiveInputFormat and HiveInputFormat.


From: Joe Crobak [mailto:joec...@gmail.com]
Sent: Tuesday, August 28, 2012 10:22 PM
To: user@hive.apache.org
Subject: Re: Hive on Amazon EC2 with S3

Hi Suman,

We've seen this happen due to a bug in Hive's CombineHiveInputFormat. Try 
disabling that before querying by issuing:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

HTH,
Joe

On Fri, Aug 24, 2012 at 4:43 PM, 
<suman.adda...@sanofipasteur.com<mailto:suman.adda...@sanofipasteur.com>> wrote:
Hi,
I have setup a Hadoop cluster on Amazon EC2 with my data stored on S3. I would 
like to use Hive to process the data on S3.

I created an external table in hive using the following:
CREATE EXTERNAL TABLE mytable1
(
  HIT_TIME_GMT            string,
  SERVICE                 string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://com.xxxxx.webanalytics/hive/';

I loaded a few records into the table (LOAD DATA LOCAL INPATH 
'/home/ubuntu/data/play/test' INTO TABLE mytable1;) .

Select * from mytable1; shows me the data in the table.

When I try to run the query which requires a map-reduce job to be run, for 
example, select count(*) from mytable1; I see an exception thrown.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
java.io.FileNotFoundException: File does not exist: /hive/test
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:527)
        at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
        at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
        at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212)
        at 
org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347)
        at 
org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377)
        at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1026)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1018)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:929)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
        at 
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671)
        at 
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)
        at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Job Submission failed with exception 'java.io.FileNotFoundException(File does 
not exist: /hive/test)'
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MapRedTask

The file does exist and I can see it on S3. Select * from table is returning 
the data in the table. I am not sure what is going wrong when a map-reduce job 
is being initiated by the hive query. Any pointer as to where I went wrong? 
Appreciate your help.

Thank you
Suman

RE: Hive on Amazon EC2 with S3

Reply via email to