Following up with more information: * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR. * I don't see a job conf on the tracker page -- I'm semi-suspicious it never makes it that far. * Here's the extended explain plan: it doesn't look glaringly wrong.
Totally appreciate any help, Ranjan hive> explain extended select count(*) from ranjan_test; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONSTAR count))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: ranjan_test TableScan alias: ranjan_test GatherStats: false Select Operator Group By Operator aggregations: expr: count() bucketGroup: false mode: hash outputColumnNames: _col0 Reduce Output Operator sort order: tag: -1 value expressions: expr: _col0 type: bigint Needs Tagging: false Path -> Alias: s3n://my.bucket/hive/ranjan_test [ranjan_test] Path -> Partition: s3n://my.bucket/hive/ranjan_test Partition base file name: ranjan_test input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: EXTERNAL TRUE bucket_count -1 columns ip_address,num_counted columns.types string:int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location s3n://my.bucket/hive/ranjan_test name default.ranjan_test serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1323982126 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: EXTERNAL TRUE bucket_count -1 columns ip_address,num_counted columns.types string:int file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location s3n://my.bucket/hive/ranjan_test name default.ranjan_test serialization.ddl struct ranjan_test { string ip_address, i32 num_counted} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1323982126 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.ranjan_test name: default.ranjan_test Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001 NumFilesPerFileSink: 1 Stats Publishing Key Prefix: hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0 columns.types bigint serialization.format 1 TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: -1 Time taken: 0.156 seconds On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote: > Hi, > > I'm experiencing the following: > > I've a file on s3 -- s3n://my.bucket/hive/ranjan_test. It's got fields > (separated by \001) and records (separated by \n). > > I want it to be accessible on hive, the ddl is: > CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test ( > ip_address string, > num_counted int > ) > STORED AS TEXTFILE > LOCATION 's3n://my.bucket/hive/ranjan_test' > > I'm able to do a simple query: > > hive> select * from ranjan_test limit 5; > OK > 98.226.198.23 1676 > 74.76.148.21 1560 > 76.64.28.25 1529 > 170.37.227.10 1363 > 71.202.128.196 1232 > Time taken: 4.172 seconds > > What I can't do is any select which fires off a mapreduce: > > ive> select count(*) from ranjan_test; > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapred.reduce.tasks=<number> > java.io.FileNotFoundException: File does not exist: > /hive/ranjan_test/part-00000 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212) > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347) > at > org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313) > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) > at > org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671) > at > org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Job Submission failed with exception 'java.io.FileNotFoundException(File does > not exist: /hive/ranjan_test/part-00000)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > > Any help? The AWS credentials seem good, 'cause otherwise I wouldn't get the > initial stuff. Should I be doing something with the other machines in the > cluster? > > Thanks in advance, > > Ranjan