RE: Filtering

Peter Marron Sun, 19 May 2013 15:12:07 -0700

Hi Owen,

Firstly I want to say a huge thank you. You have really helped me enormously.
I realize that you have been busy with other things (like release 0.11.0) and
so I can understand that it must have been a pain to take time out to help me.


>The critical piece is in OpProcFactory where the setFilterExpression is called.
>
>OpProcFactory.pushFilterToStorageHandler
>  calls tableScanDesc.setFilterExpr
>    passes to TableScanDesc.getFilterExpr
>    which is called by HiveInputFormat.pushFilters
>
>HiveInputFormat.pushFilters uses Utilities.serializeExpression to put it into 
>the configuration.
>
>Unless something is screwing it up, it looks like it hangs together.

OK. I think that I get it now. In my custom InputFormat I can read the config 
settings

JobConf .get(“"hive.io.filter.text"”);
JobConf .get(“"hive.io.filter.expr.serialized"”);

And so I can then find the predicate that I need to do the filtering.
In particular I can set the input splits so that it just reads the right 
records.

>Really? With ORC, allowing the reader to skip over rows that don't matter is 
>very important. Keeping Hive from rechecking the predicate is a nice to have.

Of course, you’re right. It doesn’t matter if the predicate is applied again to 
the records that are
already filtered. I meant that I couldn’t afford to leave the filter in place 
as it would mean that
a Map/Reduce would occur. But…

>There has been some work to add additional queries 
>(https://issues.apache.org/jira/browse/HIVE-2925),
> but if what you want is to run locally without MR, yeah, getting the 
> predicate into the RecordReader isn't enough.
>I haven't looked through HIVE-2925 to see what is supported, but that is where 
>I'd start.
>-- Owen

You’re right! HIVE-2925 is exactly what I want and now that I have found out 
how to make it work
set hive.fetch.task.conversion=more;
I am really in good shape. Thanks.

There a couple of quick questions that I would like to know the answers to 
though.


1)      I didn’t know about HIVE-2925 and I would never have thought that 
suppressing the

Map/Reduce would be controlled by something called “hive.fetch.task.conversion”

So maybe I’m missing a trick. How should I have found out about HIVE-2925?

2)      I would like to parse the filter.expr.serialized XML and I assume that 
there’s some
SAX, DOM or even XLST already in HIVE. Could you give me a pointer to which 
classes
are used (JAXP, Xerces, Xalan?) or where they are being used? Not important,
I’m just being lazy.

3)      I really want to do my filtering in the getSplits of my custom 
InputFormat. However
I have found that my getSplits is not being called. (And I asked about this on 
the list
before.) I have found that if I do this
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
then my method is invoked. It seems to be something to do with avoiding
the use of the org.apache.hadoop.hive.ql.io.CombineHiveInputFormat class.
However I don’t know whether there are any other bad things that will happen
if I make this change as I don’t really know what I’m doing.
Is this a safe thing to do?


There are some other (less important) problems which I will ask about under 
separate cover.

However I would like to say thanks again. If we ever meet in the real world
I’ll stand you a beer (or equivalent).

Congratulations on version 0.11.0.

Z
aka
Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: peter.mar...@trilliumsoftware.com<mailto:roy.willi...@trilliumsoftware.com>

RE: Filtering

Reply via email to