On Sun, May 19, 2013 at 3:11 PM, Peter Marron < peter.mar...@trilliumsoftware.com> wrote:
> Hi Owen,**** > > ** ** > > Firstly I want to say a huge thank you. You have really helped me > enormously. > You're welcome. **** > > OK. I think that I get it now. In my custom InputFormat I can read the > config settings > ** ** > > JobConf .get(“"hive.io.filter.text"”);**** > > JobConf .get(“"hive.io.filter.expr.serialized"”); > well, you don't need double quotes, but yes. > **** > > ** ** > > And so I can then find the predicate that I need to do the filtering.**** > > In particular I can set the input splits so that it just reads the right > records. > Right. You want the serialized one, because there is an API to convert it back to a data structure. > **** > > 1) **I didn’t know about HIVE-2925 and I would never have thought > that suppressing the > > Map/Reduce would be controlled by something called > “hive.fetch.task.conversion”**** > > So maybe I’m missing a trick. How should I have found out about HIVE-2925? > There isn't a "trick" other than being willing to ask on the user lists and use your favorite search engine. As Hive developers, we absolutely need to make more things happen automatically and reduce the need to know specific magic incantations. Or at least document the magic incantations. *smile* > **** > > **2) **I would like to parse the filter.expr.serialized XML and I > assume that there’s some > SAX, DOM or even XLST already in HIVE. Could you give me a pointer to > which classes > are used (JAXP, Xerces, Xalan?) or where they are being used? Not > important, > I’m just being lazy. > If you look at pushFilters, it is using Utilities.serializeExpression, so Utilities.deserializeExpression will reverse it. > **** > > **3) **I really want to do my filtering in the getSplits of my > custom InputFormat. However > I have found that my getSplits is not being called. (And I asked about > this on the list > before.) I have found that if I do this > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat > then my method is invoked. It seems to be something to do with avoiding > the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class. > However I don’t know whether there are any other bad things that will > happen > if I make this change as I don’t really know what I’m doing. > Is this a safe thing to do? > Yes, that is a fine thing to do. It does mean that you'll need to ensure you don't have too many maps, but other than that you should be ok. The primary purpose of CombineHiveInputFormat is to allow Mappers to read from multiple files. > However I would like to say thanks again. If we ever meet in the real world > > I’ll stand you a beer (or equivalent). > Sounds good, although I'll take the equivalent, since I don't enjoy alcohol. > **** > > ** ** > > Congratulations on version 0.11.0. > Thanks! -- Owen