It sounds like you are using Apache Hive. I don’t think it supports querying data on S3, does it?
https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/34693af0fa6a9101 On Dec 16, 2011, at 10:43 AM, Ranjan Bagchi wrote: > Following up with more information: > > * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. > * I don't see a job conf on the tracker page -- I'm semi-suspicious it never > makes it that far. > * Here's the extended explain plan: it doesn't look glaringly wrong. > > Totally appreciate any help, > > Ranjan > > hive> explain extended select count(*) from ranjan_test; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > (TOK_FUNCTIONSTAR count))))) > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > ranjan_test > TableScan > alias: ranjan_test > GatherStats: false > Select Operator > Group By Operator > aggregations: > expr: count() > bucketGroup: false > mode: hash > outputColumnNames: _col0 > Reduce Output Operator > sort order: > tag: -1 > value expressions: > expr: _col0 > type: bigint > Needs Tagging: false > Path -> Alias: > s3n://my.bucket/hive/ranjan_test [ranjan_test] > Path -> Partition: > s3n://my.bucket/hive/ranjan_test > Partition > base file name: ranjan_test > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > EXTERNAL TRUE > bucket_count -1 > columns ip_address,num_counted > columns.types string:int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location s3n://my.bucket/hive/ranjan_test > name default.ranjan_test > serialization.ddl struct ranjan_test { string ip_address, i32 > num_counted} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > transient_lastDdlTime 1323982126 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > EXTERNAL TRUE > bucket_count -1 > columns ip_address,num_counted > columns.types string:int > file.inputformat org.apache.hadoop.mapred.TextInputFormat > file.outputformat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > location s3n://my.bucket/hive/ranjan_test > name default.ranjan_test > serialization.ddl struct ranjan_test { string ip_address, i32 > num_counted} > serialization.format 1 > serialization.lib > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > transient_lastDdlTime 1323982126 > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > name: default.ranjan_test > name: default.ranjan_test > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(VALUE._col0) > bucketGroup: false > mode: mergepartial > outputColumnNames: _col0 > Select Operator > expressions: > expr: _col0 > type: bigint > outputColumnNames: _col0 > File Output Operator > compressed: false > GlobalTableId: 0 > directory: > hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001 > NumFilesPerFileSink: 1 > Stats Publishing Key Prefix: > hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/ > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > properties: > columns _col0 > columns.types bigint > serialization.format 1 > TotalFiles: 1 > GatherStats: false > MultiFileSpray: false > > Stage: Stage-0 > Fetch Operator > limit: -1 > > > Time taken: 0.156 seconds > > On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: > >> Hi, >> >> I'm experiencing the following: >> >> I've a file on s3 -- s3n://my.bucket/hive/ranjan_test. It's got fields >> (separated by \001) and records (separated by \n). >> >> I want it to be accessible on hive, the ddl is: >> CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( >> ip_address string, >> num_counted int >> ) >> STORED AS TEXTFILE >> LOCATION 's3n://my.bucket/hive/ranjan_test' >> >> I'm able to do a simple query: >> >> hive> select * from ranjan_test limit 5; >> OK >> 98.226.198.23 1676 >> 74.76.148.21 1560 >> 76.64.28.25 1529 >> 170.37.227.10 1363 >> 71.202.128.196 1232 >> Time taken: 4.172 seconds >> >> What I can't do is any select which fires off a mapreduce: >> >> ive> select count(*) from ranjan_test; >> Total MapReduce jobs = 1 >> Launching Job 1 out of 1 >> Number of reduce tasks determined at compile time: 1 >> In order to change the average load for a reducer (in bytes): >> set hive.exec.reducers.bytes.per.reducer=<number> >> In order to limit the maximum number of reducers: >> set hive.exec.reducers.max=<number> >> In order to set a constant number of reducers: >> set mapred.reduce.tasks=<number> >> java.io.FileNotFoundException: File does not exist: >> /hive/ranjan_test/part-00000 >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) >> at >> org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) >> at >> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) >> at >> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) >> at >> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) >> at >> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) >> at >> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) >> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) >> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) >> at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) >> at >> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) >> at >> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) >> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) >> at >> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) >> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) >> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) >> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) >> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) >> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) >> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >> Job Submission failed with exception 'java.io.FileNotFoundException(File >> does not exist: /hive/ranjan_test/part-00000)' >> FAILED: Execution Error, return code 1 from >> org.apache.hadoop.hive.ql.exec.MapRedTask >> >> >> Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get >> the initial stuff. Should I be doing something with the other machines in >> the cluster? >> >> Thanks in advance, >> >> Ranjan >