It sounds like you are using Apache Hive. I don’t think it supports querying 
data on S3, does it?

https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/34693af0fa6a9101


On Dec 16, 2011, at 10:43 AM, Ranjan Bagchi wrote:

> Following up with more information:
> 
> * The hadoop cluster is on EC2, not EMR, but I'll try bringing it up on EMR.
> * I don't see a job conf on the tracker page -- I'm semi-suspicious it never 
> makes it that far.
> * Here's the extended explain plan: it doesn't look glaringly wrong.
> 
> Totally appreciate any help,
> 
> Ranjan
> 
> hive> explain extended select count(*) from ranjan_test;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ranjan_test))) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
> (TOK_FUNCTIONSTAR count)))))
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         ranjan_test 
>           TableScan
>             alias: ranjan_test
>             GatherStats: false
>             Select Operator
>               Group By Operator
>                 aggregations:
>                       expr: count()
>                 bucketGroup: false
>                 mode: hash
>                 outputColumnNames: _col0
>                 Reduce Output Operator
>                   sort order: 
>                   tag: -1
>                   value expressions:
>                         expr: _col0
>                         type: bigint
>       Needs Tagging: false
>       Path -> Alias:
>         s3n://my.bucket/hive/ranjan_test [ranjan_test]
>       Path -> Partition:
>         s3n://my.bucket/hive/ranjan_test 
>           Partition
>             base file name: ranjan_test
>             input format: org.apache.hadoop.mapred.TextInputFormat
>             output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>             properties:
>               EXTERNAL TRUE
>               bucket_count -1
>               columns ip_address,num_counted
>               columns.types string:int
>               file.inputformat org.apache.hadoop.mapred.TextInputFormat
>               file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               location s3n://my.bucket/hive/ranjan_test
>               name default.ranjan_test
>               serialization.ddl struct ranjan_test { string ip_address, i32 
> num_counted}
>               serialization.format 1
>               serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               transient_lastDdlTime 1323982126
>             serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>           
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               properties:
>                 EXTERNAL TRUE
>                 bucket_count -1
>                 columns ip_address,num_counted
>                 columns.types string:int
>                 file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                 file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                 location s3n://my.bucket/hive/ranjan_test
>                 name default.ranjan_test
>                 serialization.ddl struct ranjan_test { string ip_address, i32 
> num_counted}
>                 serialization.format 1
>                 serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 transient_lastDdlTime 1323982126
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: default.ranjan_test
>             name: default.ranjan_test
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(VALUE._col0)
>           bucketGroup: false
>           mode: mergepartial
>           outputColumnNames: _col0
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: bigint
>             outputColumnNames: _col0
>             File Output Operator
>               compressed: false
>               GlobalTableId: 0
>               directory: 
> hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001
>               NumFilesPerFileSink: 1
>               Stats Publishing Key Prefix: 
> hdfs://ip-10-122-91-181.ec2.internal:8020/tmp/hive-ranjan/hive_2011-12-16_13-36-36_068_574825253968459560/-ext-10001/
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     columns _col0
>                     columns.types bigint
>                     serialization.format 1
>               TotalFiles: 1
>               GatherStats: false
>               MultiFileSpray: false
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> 
> 
> Time taken: 0.156 seconds
> 
> On Dec 15, 2011, at 5:30 PM, Ranjan Bagchi wrote:
> 
>> Hi,
>> 
>> I'm experiencing the following:  
>> 
>> I've a file on s3 -- s3n://my.bucket/hive/ranjan_test.  It's got fields 
>> (separated by \001) and records (separated by \n).
>> 
>> I want it to be accessible on hive, the ddl is:
>> CREATE EXTERNAL TABLE IF NOT EXISTS ranjan_test (
>> ip_address string,
>> num_counted int
>> )
>> STORED AS TEXTFILE
>> LOCATION 's3n://my.bucket/hive/ranjan_test'
>> 
>> I'm able to do a simple query:
>> 
>> hive> select * from ranjan_test limit 5;
>> OK
>> 98.226.198.23        1676
>> 74.76.148.21 1560
>> 76.64.28.25  1529
>> 170.37.227.10        1363
>> 71.202.128.196       1232
>> Time taken: 4.172 seconds
>> 
>> What I can't do is any select which fires off a mapreduce:
>> 
>> ive> select count(*) from ranjan_test; 
>> Total MapReduce jobs = 1
>> Launching Job 1 out of 1
>> Number of reduce tasks determined at compile time: 1
>> In order to change the average load for a reducer (in bytes):
>> set hive.exec.reducers.bytes.per.reducer=<number>
>> In order to limit the maximum number of reducers:
>> set hive.exec.reducers.max=<number>
>> In order to set a constant number of reducers:
>> set mapred.reduce.tasks=<number>
>> java.io.FileNotFoundException: File does not exist: 
>> /hive/ranjan_test/part-00000
>>      at 
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
>>      at 
>> org.apache.hadoop.mapred.lib.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:462)
>>      at 
>> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:256)
>>      at 
>> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:212)
>>      at 
>> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:347)
>>      at 
>> org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileInputFormatShim.getSplits(Hadoop20SShims.java:313)
>>      at 
>> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:377)
>>      at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
>>      at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
>>      at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>>      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>>      at java.security.AccessController.doPrivileged(Native Method)
>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>      at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>>      at 
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>>      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
>>      at 
>> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:671)
>>      at 
>> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
>>      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>>      at 
>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>>      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>>      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>>      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>>      at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
>>      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
>>      at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:513)
>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>      at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>      at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>      at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>> Job Submission failed with exception 'java.io.FileNotFoundException(File 
>> does not exist: /hive/ranjan_test/part-00000)'
>> FAILED: Execution Error, return code 1 from 
>> org.apache.hadoop.hive.ql.exec.MapRedTask
>> 
>> 
>> Any help?  The AWS credentials seem good, 'cause otherwise I wouldn't get 
>> the initial stuff.  Should I be doing something with the other machines in 
>> the cluster?
>> 
>> Thanks in advance,
>> 
>> Ranjan
> 

Reply via email to