Hi Owen, Firstly I want to say a huge thank you. You have really helped me enormously. I realize that you have been busy with other things (like release 0.11.0) and so I can understand that it must have been a pain to take time out to help me.
>The critical piece is in OpProcFactory where the setFilterExpression is called. > >OpProcFactory.pushFilterToStorageHandler > calls tableScanDesc.setFilterExpr > passes to TableScanDesc.getFilterExpr > which is called by HiveInputFormat.pushFilters > >HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into >the configuration. > >Unless something is screwing it up, it looks like it hangs together. OK. I think that I get it now. In my custom InputFormat I can read the config settings JobConf .get(“"hive.io.filter.text"”); JobConf .get(“"hive.io.filter.expr.serialized"”); And so I can then find the predicate that I need to do the filtering. In particular I can set the input splits so that it just reads the right records. >Really? With ORC, allowing the reader to skip over rows that don't matter is >very important. Keeping Hive from rechecking the predicate is a nice to have. Of course, you’re right. It doesn’t matter if the predicate is applied again to the records that are already filtered. I meant that I couldn’t afford to leave the filter in place as it would mean that a Map/Reduce would occur. But… >There has been some work to add additional queries >(https://issues.apache.org/jira/browse/HIVE-2925), > but if what you want is to run locally without MR, yeah, getting the > predicate into the RecordReader isn't enough. >I haven't looked through HIVE-2925 to see what is supported, but that is where >I'd start. >-- Owen You’re right! HIVE-2925 is exactly what I want and now that I have found out how to make it work set hive.fetch.task.conversion=more; I am really in good shape. Thanks. There a couple of quick questions that I would like to know the answers to though. 1) I didn’t know about HIVE-2925 and I would never have thought that suppressing the Map/Reduce would be controlled by something called “hive.fetch.task.conversion” So maybe I’m missing a trick. How should I have found out about HIVE-2925? 2) I would like to parse the filter.expr.serialized XML and I assume that there’s some SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which classes are used (JAXP, Xerces, Xalan?) or where they are being used? Not important, I’m just being lazy. 3) I really want to do my filtering in the getSplits of my custom InputFormat. However I have found that my getSplits is not being called. (And I asked about this on the list before.) I have found that if I do this set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat then my method is invoked. It seems to be something to do with avoiding the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class. However I don’t know whether there are any other bad things that will happen if I make this change as I don’t really know what I’m doing. Is this a safe thing to do? There are some other (less important) problems which I will ask about under separate cover. However I would like to say thanks again. If we ever meet in the real world I’ll stand you a beer (or equivalent). Congratulations on version 0.11.0. Z aka Peter Marron Trillium Software UK Limited Tel : +44 (0) 118 940 7609 Fax : +44 (0) 118 940 7699 E: peter.mar...@trilliumsoftware.com<mailto:roy.willi...@trilliumsoftware.com>