Thank you Joe. It works now. I will try to read up on the differences between CombineHiveInputFormat and HiveInputFormat.
From: Joe Crobak [mailto:joec...@gmail.com] Sent: Tuesday, August 28, 2012 10:22 PM To: user@hive.apache.org Subject: Re: Hive on Amazon EC2 with S3 Hi Suman, We've seen this happen due to a bug in Hive's CombineHiveInputFormat. Try disabling that before querying by issuing: SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; HTH, Joe On Fri, Aug 24, 2012 at 4:43 PM, <suman.adda...@sanofipasteur.com<mailto:suman.adda...@sanofipasteur.com>> wrote: Hi, I have setup a Hadoop cluster on Amazon EC2 with my data stored on S3. I would like to use Hive to process the data on S3. I created an external table in hive using the following: CREATE EXTERNAL TABLE mytable1 ( HIT_TIME_GMT string, SERVICE string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://com.xxxxx.webanalytics/hive/'; I loaded a few records into the table (LOAD DATA LOCAL INPATH '/home/ubuntu/data/play/test' INTO TABLE mytable1;) . Select * from mytable1; shows me the data in the table. When I try to run the query which requires a map-reduce job to be run, for example, select count(*) from mytable1; I see an exception thrown. Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.io.FileNotFoundException: File does not exist: /hive/test at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:527) at org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1026) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1018) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:929) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: /hive/test)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask The file does exist and I can see it on S3. Select * from table is returning the data in the table. I am not sure what is going wrong when a map-reduce job is being initiated by the hive query. Any pointer as to where I went wrong? Appreciate your help. Thank you Suman