Also, in EMR the default file system for reading regular files is s3 rather than s3n (the latter is a block file system requiring its own bucket or something like that). Basically, s3 and s3n are switched vs the Apache implementation.
Another potential issue is that Hive (at least the EMR version) can't read individual files. It assumes that tables are "located" at a directory level and the files in the directory are the contents of the table. So in your case I would use 's3://my.bucket/hive' as location (or s3n if you are not using EMR). igor decide.com On Thu, Dec 15, 2011 at 8:09 PM, Mark Grover <mgro...@oanda.com> wrote: > Hi Ranjan, > A couple of ideas come to mind: > > 1) Do an explain (or explain extended) on the query to find out where > exactly Hive is trying to read/write to the file it's complaining about. > > 2) Look at your job conf file. There is a hyperlink to it from your Job > Tracker web page. See if there is a config option there that is pointing to > the /hive/ranjan_test directory. If you want, you can share it here for > folks to see if anything is out of ordinary. > > BTW, are you using Amazon EMR? If so, it might be worthwhile to post on > AWS forums. > > Mark > > ----- Original Message ----- > From: "Ranjan Bagchi" <ran...@powerreviews.com> > To: user@hive.apache.org > Sent: Thursday, December 15, 2011 8:30:42 PM > Subject: Help with a table located on s3n > > Hi, > > I'm experiencing the following: > > I've a file on s3 -- s3n://my.bucket/hive/ranjan_test. It's got fields > (separated by \001) and records (separated by \n). > > I want it to be accessible on hive, the ddl is: > CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( > ip_address string, > num_counted int > ) > STORED AS TEXTFILE > LOCATION 's3n://my.bucket/hive/ranjan_test' > > I'm able to do a simple query: > > hive> select * from ranjan_test limit 5; > OK > 98.226.198.23 1676 > 74.76.148.21 1560 > 76.64.28.25 1529 > 170.37.227.10 1363 > 71.202.128.196 1232 > Time taken: 4.172 seconds > > What I can't do is any select which fires off a mapreduce: > > ive> select count(*) from ranjan_test; > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapred.reduce.tasks=<number> > java.io.FileNotFoundException: File does not exist: > /hive/ranjan_test/part-00000 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) > at > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) > at > org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) > at > org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) > at > org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Job Submission failed with exception 'java.io.FileNotFoundException(File > does not exist: /hive/ranjan_test/part-00000)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > > Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get > the initial stuff. Should I be doing something with the other machines in > the cluster? > > Thanks in advance, > > Ranjan >