Re: Pointing multiple external tables to the same location

Petter von Dolwitz (Hem) Thu, 09 Jan 2014 01:08:34 -0800

Hi Navis (and others),

seems like my solution with views does not work after all. That is, it
works fine as long as I do not use filter pushdown. My setup is something
like below:


CREATE EXTERNAL TABLE MasterTable (
  column1 STRING,
  column2 STRING,
  column3 STRING
  column4 STRING)
  PARTITIONED BY (partition INT)
  ROW FORMAT SERDE 'MySerde'
  STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 'my_location';

CREATE VIEW IF NOT EXISTS View1
  (column1, column2, column3, column4, partition)
PARTITIONED ON (partition)
AS SELECT column1, column2, column3, column4, partition
FROM MasterEventTable
WHERE column1='value1' AND column2='value2';

CREATE VIEW IF NOT EXISTS View2
  (column1, column2, column3, column4, partition)
PARTITIONED ON (partition)
AS SELECT column1, column2, column3, column4, partition
FROM MasterEventTable
WHERE column1='value3' AND column2='value4';


The following query works fine without filter pushdown:
SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);

Now if I enable filter pushdown (setting hive.optimize.index.filter=true)
and apply the filter in my record reader I do not get the correct result. I
do not get any records back at all. It seems like only the second filter
(column1='value3' AND column2='value4) is pushed to my record reader. The
underlying file is only traversed once. I would have expected that I either
got an OR expression down ((column1='value3' AND column2='value4) OR
(column1='value1' AND column2='value2)) or that the underlying file was
scanned twice with each separate expression.

Do you have any thoughts on this?

Thanks,
Petter





2013/12/22 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>

> Hi Navis,
>
> thank you for sorting this out! I have tried getting around this by using
> views towards a single master table instead in combination with UDFs
>  instead . Seems to work so far.
>
> /Petter
>
>
> 2013/12/18 Navis류승우 <navis....@nexr.com>
>
>> Hive uses path to table(or partition) mapping internally (you can see
>> that in MapredWork, etc.), which might caused first table overwritten by
>> other.
>>
>> I didn't tried symlink on hdfs, which could be a solution.
>>
>>
>>
>> 2013/12/12 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>
>>
>> Hi,
>>>
>>> I have declared several external tables pointing to the same location.
>>> The things that tells these tables apart (apart from their names) is that
>>> they have unique properties. These properties help me choose the correct
>>> rows from the underlying file. I use a single storage handler (accompanied
>>> by a single InputFormat and a single Serde) . The first columns in all
>>> tables are the same but the last (a struct) is unique and
>>> is constructed from the Serde (with help of the serde properties). A
>>> simplified version of the tables look like so:
>>>
>>> CREATE EXTERNAL TABLE Table1 (
>>>   column1 STRING,
>>>   column2 STRING)
>>>   STORED BY 'MyStorageHandler'
>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>   LOCATION 'mylocation'
>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>
>>> CREATE EXTERNAL TABLE Table2 (
>>>   column1 STRING,
>>>   column2 STRING)
>>>   STORED BY 'MyStorageHandler'
>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>   LOCATION 'mylocation'
>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>
>>>
>>> All works well for simple select queries towards the two tables. The
>>> following query gives very strange results though:
>>>
>>> SELECT * FROM (
>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>   union all
>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>   ) my_union
>>> ORDER BY my_union.column1
>>>
>>>
>>> It seems like one job task is created per file stored in the table
>>> location. This task gets the table properties from the second table and in
>>> the SerDe-step later on it seems like the records gets mixed up.
>>>
>>> I would have expected that hive would need to iterated the source files
>>> two times using two different tasks (with the correct table properties
>>> passed) in order to get this to work.
>>>
>>> Anyone here that can shed some light on this scenario?
>>>
>>> Thanks,
>>> Petter
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Reply via email to