Hi, I have declared several external tables pointing to the same location. The things that tells these tables apart (apart from their names) is that they have unique properties. These properties help me choose the correct rows from the underlying file. I use a single storage handler (accompanied by a single InputFormat and a single Serde) . The first columns in all tables are the same but the last (a struct) is unique and is constructed from the Serde (with help of the serde properties). A simplified version of the tables look like so:
CREATE EXTERNAL TABLE Table1 ( column1 STRING, column2 STRING) STORED BY 'MyStorageHandler' WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1') LOCATION 'mylocation' TBLPROPERTIES('recordreader.filter'='table1_filter'); CREATE EXTERNAL TABLE Table2 ( column1 STRING, column2 STRING) STORED BY 'MyStorageHandler' WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2') LOCATION 'mylocation' TBLPROPERTIES('recordreader.filter'='table2_filter'); All works well for simple select queries towards the two tables. The following query gives very strange results though: SELECT * FROM ( SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue' union all SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue' ) my_union ORDER BY my_union.column1 It seems like one job task is created per file stored in the table location. This task gets the table properties from the second table and in the SerDe-step later on it seems like the records gets mixed up. I would have expected that hive would need to iterated the source files two times using two different tasks (with the correct table properties passed) in order to get this to work. Anyone here that can shed some light on this scenario? Thanks, Petter