Filtering

Peter Marron Wed, 15 May 2013 03:39:03 -0700

Hi,

I'm using Hive 0.10.0 and Hadoop 1.0.4.


I would like to create a normal table but have some of my code run so that I 
can remove filtering
parts of the query and limit the output in the splits of the InputFormat. I 
believe that this is
"Filtering Pushdown" as described in 
https://cwiki.apache.org/Hive/filterpushdowndev.html
I have tried various approaches and run into problems and I was wondering
if anyone had any suggestions as to how I might proceed.

Firstly although that page mentions InputFormat there doesn't seem to be any 
way (that I can find)
to perform filter passing to InputFormats and so I gave up on that approach.

I see that in method pushFilterToStorageHandler around line 776 of file 
OpProcFactory.java
there is a call to the decomposePredicate method of the storage handler if it's 
an instance of HiveStoragePredicateHandler.
This looks exactly like what I want. So my first approach is to provide my own 
StorageHandler.

However when I create a table with my own custom storage handler I get this 
behaviour:

                > DROP TABLE ordinals2;
                OK
                Time taken: 0.136 seconds
                hive> CREATE TABLE ordinals2 (english STRING, number INT, 
italian STRING)
                >  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
                >  STORED BY 'StorageWindow';
                getInputFormatClass
                getOutputFormatClass
                OK
                Time taken: 0.075 seconds
                hive> LOAD DATA LOCAL INPATH 'ordinals.txt' OVERWRITE INTO 
TABLE ordinals2;
                FAILED: SemanticException [Error 10101]: A non-native table 
cannot be used as target for LOAD
                hive>

However I don't care if the table is "non-native" or not, I just want to use it 
in the same
way that I use a normal table. Is there some way that I can get the normal 
behaviour
and still use a custom storage handler?

OK, given that the StorageHandler approach doesn't seem to working out for me I 
tried
the next obvious approach - Indexes. So I created my own index class called 
"DummyIndex'.
Because I wanted to keep this index as simple as possible I had it return 
'false' in the
method usesIndexTable. However when I then try and create an index of this type 
I get
this error.

                > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian)
                >  AS 'DummyIndex' WITH DEFERRED REBUILD;
                FAILED: Error in metadata: java.lang.NullPointerException
                FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
                hive>

A little investigation suggests that this is failing in method add_Index
lines 2796-2822 of file HiveMetaStore.java. Here we can see that the
argument indexTable is de-referenced on line 2798 for what looks like a tracing
statement. This gives the NPE if the argument is null.

If we look at method createIndex lines 614-759 of file Hive.java
we can see that the variable tt is initialised to null on line 722 and passed
to the CreateIndex method on line 754. But it can only change to a non-null
value if the indexHandler returns true in the usesIndexTable method on line 725.

The bottom line seems to be that the code can't work if you have an index
that returns false for usesIndexTable. Isn't this a bug? Do I need to raise a 
JIRA?

If I change my index so that it returns true for usesIndexTable but still try 
and keep
all the other methods minimal I get this error when I try to create an index.

                > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian)
                >  AS 'DummyIndex' WITH DEFERRED REBUILD;
                FAILED: Error in metadata: at least one column must be 
specified for the table
                FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask

So, at the moment, it looks like I'm going to have my index class derive from 
one of the reference
implementations (org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler) 
in order to get it to work at all.

Is it worth trying to build Hive from source so that I can hack out the tracing 
that causes the NPE?
Or is it likely to start failing somewhere else?

Any comments welcome.

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: peter.mar...@trilliumsoftware.com<mailto:roy.willi...@trilliumsoftware.com>

Filtering

Reply via email to