Hi, I seem to have a problem getting Hive to use a custom InputFormat.
I am using Hive version 0.10.0 with Hadoop 1.0.4 on Centos 6.3 currently in standalone mode. At this stage I am just experimenting. I have a file with 10 records which I am using for testing. I've created a table called zownvehead to access this file. So if I do select * from zownvehead; I get the 10 records and if I do select count(1) from zownvehead; then I get the result 10. No surprises. Now I've created my own class package com.trilliumsoftware.loader.duality; public class WrappedInputFormat implements InputFormat<LongWritable, Text>, JobConfigurable { And I've written this class to restrict the number of records. Specifically, in the getSplits method instead of returning the whole file I return two splits which effectively limit the data scanned to two records instead of 10. (Inside my class I create an instance of TextInputFormat I delegate all the calls to this instance apart from getSplits where I call the method on TextInputFormat and then I use the result to build two new FileSplits which I return instead.) I delete the table and re-create it with the following CREATE EXTERNAL TABLE zownvehead (PID STRING, ... lots of other columns elided... AHM_STAT_CODE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'com.trilliumsoftware.loader.duality.WrappedInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; Now when I perform select * from zownvehead; then, much to my delight, I see only the two records. However when I perform select count(1) from zownvehead; I get the result 10 and not 2, as I would expect. So the results of the two queries are inconsistent. When I investigate I can see that, in the second query, the class CombineHiveInputFormat is being used. I can see that an instance of my class WrappedInputFormat is being constructed and configured. I can also see that when the query runs this instance of my class is being used to obtained a record reader (that is the public RecordReader<LongWritable,Text> getRecordReader(InputSplit split, JobConf jc, Reporter rprtr) throws IOException { method is being invoked. However the getSplits method is _not_ being invoked and the split being passed to the getRecordReader method is a FileSplit (or derived class) for the whole file. I've had a look at the source of CombineHiveInputFormat and it seems to be looking for an InputFormat class to invoked getSplits based on the path. But I can't see why it might get it wrong, or what I can do to help it get it right. I suppose that I could build my own version of Hive with instrumentation to see exactly what's going on, but I'd like to avoid that if I can. So can anyone tell me why the CombineHiveInputFormat wrapped class is not calling my getSplits? And why this only seems to happen if a Map/Reduce is required? And, most importantly, what do I have to do to get it to work the way that I expect? Any help or comments would be welcome. Peter Marron Trillium Software UK Limited Tel : +44 (0) 118 940 7609 Fax : +44 (0) 118 940 7699 E: peter.mar...@trilliumsoftware.com<mailto:roy.willi...@trilliumsoftware.com>