Hi Navis (and others), seems like my solution with views does not work after all. That is, it works fine as long as I do not use filter pushdown. My setup is something like below:
CREATE EXTERNAL TABLE MasterTable ( column1 STRING, column2 STRING, column3 STRING column4 STRING) PARTITIONED BY (partition INT) ROW FORMAT SERDE 'MySerde' STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'my_location'; CREATE VIEW IF NOT EXISTS View1 (column1, column2, column3, column4, partition) PARTITIONED ON (partition) AS SELECT column1, column2, column3, column4, partition FROM MasterEventTable WHERE column1='value1' AND column2='value2'; CREATE VIEW IF NOT EXISTS View2 (column1, column2, column3, column4, partition) PARTITIONED ON (partition) AS SELECT column1, column2, column3, column4, partition FROM MasterEventTable WHERE column1='value3' AND column2='value4'; The following query works fine without filter pushdown: SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3); Now if I enable filter pushdown (setting hive.optimize.index.filter=true) and apply the filter in my record reader I do not get the correct result. I do not get any records back at all. It seems like only the second filter (column1='value3' AND column2='value4) is pushed to my record reader. The underlying file is only traversed once. I would have expected that I either got an OR expression down ((column1='value3' AND column2='value4) OR (column1='value1' AND column2='value2)) or that the underlying file was scanned twice with each separate expression. Do you have any thoughts on this? Thanks, Petter 2013/12/22 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com> > Hi Navis, > > thank you for sorting this out! I have tried getting around this by using > views towards a single master table instead in combination with UDFs > instead . Seems to work so far. > > /Petter > > > 2013/12/18 Navis류승우 <navis....@nexr.com> > >> Hive uses path to table(or partition) mapping internally (you can see >> that in MapredWork, etc.), which might caused first table overwritten by >> other. >> >> I didn't tried symlink on hdfs, which could be a solution. >> >> >> >> 2013/12/12 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com> >> >> Hi, >>> >>> I have declared several external tables pointing to the same location. >>> The things that tells these tables apart (apart from their names) is that >>> they have unique properties. These properties help me choose the correct >>> rows from the underlying file. I use a single storage handler (accompanied >>> by a single InputFormat and a single Serde) . The first columns in all >>> tables are the same but the last (a struct) is unique and >>> is constructed from the Serde (with help of the serde properties). A >>> simplified version of the tables look like so: >>> >>> CREATE EXTERNAL TABLE Table1 ( >>> column1 STRING, >>> column2 STRING) >>> STORED BY 'MyStorageHandler' >>> WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1') >>> LOCATION 'mylocation' >>> TBLPROPERTIES('recordreader.filter'='table1_filter'); >>> >>> CREATE EXTERNAL TABLE Table2 ( >>> column1 STRING, >>> column2 STRING) >>> STORED BY 'MyStorageHandler' >>> WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2') >>> LOCATION 'mylocation' >>> TBLPROPERTIES('recordreader.filter'='table2_filter'); >>> >>> >>> All works well for simple select queries towards the two tables. The >>> following query gives very strange results though: >>> >>> SELECT * FROM ( >>> SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue' >>> union all >>> SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue' >>> ) my_union >>> ORDER BY my_union.column1 >>> >>> >>> It seems like one job task is created per file stored in the table >>> location. This task gets the table properties from the second table and in >>> the SerDe-step later on it seems like the records gets mixed up. >>> >>> I would have expected that hive would need to iterated the source files >>> two times using two different tasks (with the correct table properties >>> passed) in order to get this to work. >>> >>> Anyone here that can shed some light on this scenario? >>> >>> Thanks, >>> Petter >>> >>> >>> >>> >>> >>> >>> >> >