Following up with more information:

* The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR.
* I don't see a job conf on the tracker page -- I'm semi-suspicious it never 
makes it that far.
* Here's the extended explain plan: it doesn't look glaringly wrong.

Totally appreciate any help,

Ranjan

hive> explain extended select count(*) from ranjan_test;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(TOK_FUNCTIONSTAR count)))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        ranjan_test 
          TableScan
            alias: ranjan_test
            GatherStats: false
            Select Operator
              Group By Operator
                aggregations:
                      expr: count()
                bucketGroup: false
                mode: hash
                outputColumnNames: _col0
                Reduce Output Operator
                  sort order: 
                  tag: -1
                  value expressions:
                        expr: _col0
                        type: bigint
      Needs Tagging: false
      Path -> Alias:
        s3n://my.bucket/hive/ranjan_test [ranjan_test]
      Path -> Partition:
        s3n://my.bucket/hive/ranjan_test 
          Partition
            base file name: ranjan_test
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              EXTERNAL TRUE
              bucket_count -1
              columns ip_address,num_counted
              columns.types string:int
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location s3n://my.bucket/hive/ranjan_test
              name default.ranjan_test
              serialization.ddl struct ranjan_test { string ip_address, i32 
num_counted}
              serialization.format 1
              serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              transient_lastDdlTime 1323982126
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                EXTERNAL TRUE
                bucket_count -1
                columns ip_address,num_counted
                columns.types string:int
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location s3n://my.bucket/hive/ranjan_test
                name default.ranjan_test
                serialization.ddl struct ranjan_test { string ip_address, i32 
num_counted}
                serialization.format 1
                serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1323982126
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.ranjan_test
            name: default.ranjan_test
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          mode: mergepartial
          outputColumnNames: _col0
          Select Operator
            expressions:
                  expr: _col0
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: false
              GlobalTableId: 0
              directory: 
hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001
              NumFilesPerFileSink: 1
              Stats Publishing Key Prefix: 
hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    columns _col0
                    columns.types bigint
                    serialization.format 1
              TotalFiles: 1
              GatherStats: false
              MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1


Time taken: 0.156 seconds

On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote:

> Hi,
> 
> I'm experiencing the following:  
> 
> I've a file on s3 -- s3n://my.bucket/hive/ranjan_test.  It's got fields 
> (separated by \001) and records (separated by \n).
> 
> I want it to be accessible on hive, the ddl is:
> CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test (
> ip_address string,
> num_counted int
> )
> STORED AS TEXTFILE
> LOCATION 's3n://my.bucket/hive/ranjan_test'
> 
> I'm able to do a simple query:
> 
> hive> select * from ranjan_test limit 5;
> OK
> 98.226.198.23 1676
> 74.76.148.21  1560
> 76.64.28.25   1529
> 170.37.227.10 1363
> 71.202.128.196        1232
> Time taken: 4.172 seconds
> 
> What I can't do is any select which fires off a mapreduce:
> 
> ive> select count(*) from ranjan_test; 
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
> set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> set mapred.reduce.tasks=<number>
> java.io.FileNotFoundException: File does not exist: 
> /hive/ranjan_test/part-00000
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
>       at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
>       at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
>       at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212)
>       at 
> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347)
>       at 
> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313)
>       at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377)
>       at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
>       at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
>       at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
>       at 
> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671)
>       at 
> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
>       at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>       at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>       at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>       at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
>       at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Job Submission failed with exception 'java.io.FileNotFoundException(File does 
> not exist: /hive/ranjan_test/part-00000)'
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MapRedTask
> 
> 
> Any help?  The AWS credentials seem good, 'cause otherwise I wouldn't get the 
> initial stuff.  Should I be doing something with the other machines in the 
> cluster?
> 
> Thanks in advance,
> 
> Ranjan

Reply via email to