Hi,

I have declared several external tables pointing to the same location. The
things that tells these tables apart (apart from their names) is that they
have unique properties. These properties help me choose the correct rows
from the underlying file. I use a single storage handler (accompanied by a
single InputFormat and a single Serde) . The first columns in all tables
are the same but the last (a struct) is unique and
is constructed from the Serde (with help of the serde properties). A
simplified version of the tables look like so:

CREATE EXTERNAL TABLE Table1 (
  column1 STRING,
  column2 STRING)
  STORED BY 'MyStorageHandler'
  WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
  LOCATION 'mylocation'
  TBLPROPERTIES('recordreader.filter'='table1_filter');

CREATE EXTERNAL TABLE Table2 (
  column1 STRING,
  column2 STRING)
  STORED BY 'MyStorageHandler'
  WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
  LOCATION 'mylocation'
  TBLPROPERTIES('recordreader.filter'='table2_filter');


All works well for simple select queries towards the two tables. The
following query gives very strange results though:

SELECT * FROM (
  SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
  union all
  SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
  ) my_union
ORDER BY my_union.column1


It seems like one job task is created per file stored in the table
location. This task gets the table properties from the second table and in
the SerDe-step later on it seems like the records gets mixed up.

I would have expected that hive would need to iterated the source files two
times using two different tasks (with the correct table properties passed)
in order to get this to work.

Anyone here that can shed some light on this scenario?

Thanks,
Petter

Reply via email to