Hi, I'm using Hive 0.10.0 and Hadoop 1.0.4.
I would like to create a normal table but have some of my code run so that I can remove filtering parts of the query and limit the output in the splits of the InputFormat. I believe that this is "Filtering Pushdown" as described in https://cwiki.apache.org/Hive/filterpushdowndev.html I have tried various approaches and run into problems and I was wondering if anyone had any suggestions as to how I might proceed. Firstly although that page mentions InputFormat there doesn't seem to be any way (that I can find) to perform filter passing to InputFormats and so I gave up on that approach. I see that in method pushFilterToStorageHandler around line 776 of file OpProcFactory.java there is a call to the decomposePredicate method of the storage handler if it's an instance of HiveStoragePredicateHandler. This looks exactly like what I want. So my first approach is to provide my own StorageHandler. However when I create a table with my own custom storage handler I get this behaviour: > DROP TABLE ordinals2; OK Time taken: 0.136 seconds hive> CREATE TABLE ordinals2 (english STRING, number INT, italian STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED BY 'StorageWindow'; getInputFormatClass getOutputFormatClass OK Time taken: 0.075 seconds hive> LOAD DATA LOCAL INPATH 'ordinals.txt' OVERWRITE INTO TABLE ordinals2; FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD hive> However I don't care if the table is "non-native" or not, I just want to use it in the same way that I use a normal table. Is there some way that I can get the normal behaviour and still use a custom storage handler? OK, given that the StorageHandler approach doesn't seem to working out for me I tried the next obvious approach - Indexes. So I created my own index class called "DummyIndex'. Because I wanted to keep this index as simple as possible I had it return 'false' in the method usesIndexTable. However when I then try and create an index of this type I get this error. > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian) > AS 'DummyIndex' WITH DEFERRED REBUILD; FAILED: Error in metadata: java.lang.NullPointerException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive> A little investigation suggests that this is failing in method add_Index lines 2796-2822 of file HiveMetaStore.java. Here we can see that the argument indexTable is de-referenced on line 2798 for what looks like a tracing statement. This gives the NPE if the argument is null. If we look at method createIndex lines 614-759 of file Hive.java we can see that the variable tt is initialised to null on line 722 and passed to the CreateIndex method on line 754. But it can only change to a non-null value if the indexHandler returns true in the usesIndexTable method on line 725. The bottom line seems to be that the code can't work if you have an index that returns false for usesIndexTable. Isn't this a bug? Do I need to raise a JIRA? If I change my index so that it returns true for usesIndexTable but still try and keep all the other methods minimal I get this error when I try to create an index. > CREATE INDEX dummy_ordinals ON TABLE ordinals (italian) > AS 'DummyIndex' WITH DEFERRED REBUILD; FAILED: Error in metadata: at least one column must be specified for the table FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask So, at the moment, it looks like I'm going to have my index class derive from one of the reference implementations (org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler) in order to get it to work at all. Is it worth trying to build Hive from source so that I can hack out the tracing that causes the NPE? Or is it likely to start failing somewhere else? Any comments welcome. Peter Marron Trillium Software UK Limited Tel : +44 (0) 118 940 7609 Fax : +44 (0) 118 940 7699 E: peter.mar...@trilliumsoftware.com<mailto:roy.willi...@trilliumsoftware.com>