Re: Pointing multiple external tables to the same location

Petter von Dolwitz (Hem) Mon, 20 Jan 2014 01:18:45 -0800

Hi Navis,

one scan task with the correct filter given would be most efficient I
guess. I think the Filter Pushdown mechanism is not working correctly in
this case. It cannot be a too exotic use case to have two views on a table
that when querying yields two different select statements towards this
table. Still the Filter Pushdown mechanism presents only one of them. In my
mind, the pushed down filter should either be correct (a common denominator
of the two select statements), or empty. Or do you think I have violated
some design rule by designing the database this way?


Links is not a viable option in my situation.

Thanks,
Petter


2014/1/16 Navis류승우 <navis....@nexr.com>

> I thinks it's back to original problem.
>
> What you wanted is separated scan(task) for different view. But hive does
> not work like that. If two tables or views (or mix of them) has same
> location, it's regarded as same table with same table description (will be
> overridden by lastly visited table or view).
>
> As I suggested the first reply, hadoop link might be helpful for figuring
> out this.
>
>
>
> 2014/1/14 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>
>
> I'm using Hive 0.10 (the version bundled with CDH4.4).
>>
>> The explain at my end looks similar to yours. I guess my real concern is
>> around the way I have implemented the filters.
>>
>> This is how I have done it:
>> - In the constructor of my RecordReader I read the property
>> hive.io.filter.expr.serialized and use the IndexPredicateAnalyzer to find
>> out what parts of the filter that I can apply in my RecordReader.
>> - I process only the rows that match the filter.
>>
>> Since the filter represented in hive.io.filter.expr.serialized only
>> contains one of the filters (column1 < 100 in the example above) the rows
>> matching the other filter (column1 < 30) is lost. This specific example is
>> overlapping so I'm not sure if the result points out the problem (column1 <
>> 30 is covered by column1 < 100). In the example at my end the filters are
>> not overlapping.
>>
>> Is the RecordReader the correct place to implement this filter? Should it
>> work or should the filter integration be done at another level? For the
>> example above, what did you expect hive.io.filter.text to contain?
>>
>> I might add that the tables are partitioned if that makes any difference.
>> I originally had filter negotiation in place in a StorageHandler but
>> StorageHandler did not support partitions so I switched to implementing the
>> filter directly in the RecordReader. In the RecordReader I cannot negotiate
>> filter with Hive but I can apply the filter that I can handle to prune data
>> early.
>>
>> Thank you for your support,
>> Petter
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2014/1/14 Navis류승우 <navis....@nexr.com>
>>
>>> In my try, it worked (and should be).
>>>
>>> CREATE EXTERNAL TABLE MasterTable (
>>>   column1 STRING, column2 STRING)
>>>   LOCATION 'hdfs://localhost:9000/home/navis/my_location';
>>>
>>> CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
>>> column2 FROM MasterTable WHERE column1<30;
>>>
>>> CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
>>> column2 FROM MasterTable WHERE column1<100;
>>>
>>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);
>>>
>>> below is result of explain, which takes single whole scan for master
>>> table and handled by two TS followed by FIL with expected predicates.
>>>
>>>         view1:view1:mastertable
>>>           TableScan
>>>             alias: mastertable
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (column1 < 30)
>>>                   type: boolean
>>>
>>>         view2:view2:mastertable
>>>           TableScan
>>>             alias: mastertable
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (column1 < 100)
>>>                   type: boolean
>>>
>>>       Truncated Path -> Alias:
>>>         hdfs://localhost:9000/home/navis/my_location
>>> [view1:view1:mastertable, view2:view2:mastertable]
>>>
>>> Can I ask the version of hive you are using?
>>>
>>>
>>> 2014/1/9 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>
>>>
>>> Hi Navis (and others),
>>>>
>>>> seems like my solution with views does not work after all. That is, it
>>>> works fine as long as I do not use filter pushdown. My setup is something
>>>> like below:
>>>>
>>>> CREATE EXTERNAL TABLE MasterTable (
>>>>   column1 STRING,
>>>>   column2 STRING,
>>>>   column3 STRING
>>>>   column4 STRING)
>>>>   PARTITIONED BY (partition INT)
>>>>   ROW FORMAT SERDE 'MySerde'
>>>>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
>>>> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>>>   LOCATION 'my_location';
>>>>
>>>> CREATE VIEW IF NOT EXISTS View1
>>>>   (column1, column2, column3, column4, partition)
>>>> PARTITIONED ON (partition)
>>>> AS SELECT column1, column2, column3, column4, partition
>>>> FROM MasterEventTable
>>>> WHERE column1='value1' AND column2='value2';
>>>>
>>>> CREATE VIEW IF NOT EXISTS View2
>>>>   (column1, column2, column3, column4, partition)
>>>> PARTITIONED ON (partition)
>>>> AS SELECT column1, column2, column3, column4, partition
>>>> FROM MasterEventTable
>>>> WHERE column1='value3' AND column2='value4';
>>>>
>>>>
>>>> The following query works fine without filter pushdown:
>>>> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>>>>
>>>> Now if I enable filter pushdown (setting
>>>> hive.optimize.index.filter=true) and apply the filter in my record reader I
>>>> do not get the correct result. I do not get any records back at all. It
>>>> seems like only the second filter (column1='value3' AND column2='value4) is
>>>> pushed to my record reader. The underlying file is only traversed once. I
>>>> would have expected that I either got an OR expression down
>>>> ((column1='value3' AND column2='value4) OR (column1='value1' AND
>>>> column2='value2)) or that the underlying file was scanned twice with each
>>>> separate expression.
>>>>
>>>> Do you have any thoughts on this?
>>>>
>>>> Thanks,
>>>> Petter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2013/12/22 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>
>>>>
>>>> Hi Navis,
>>>>>
>>>>> thank you for sorting this out! I have tried getting around this by
>>>>> using views towards a single master table instead in combination with UDFs
>>>>>  instead . Seems to work so far.
>>>>>
>>>>> /Petter
>>>>>
>>>>>
>>>>> 2013/12/18 Navis류승우 <navis....@nexr.com>
>>>>>
>>>>>> Hive uses path to table(or partition) mapping internally (you can see
>>>>>> that in MapredWork, etc.), which might caused first table overwritten by
>>>>>> other.
>>>>>>
>>>>>> I didn't tried symlink on hdfs, which could be a solution.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/12/12 Petter von Dolwitz (Hem) <petter.von.dolw...@gmail.com>
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I have declared several external tables pointing to the same
>>>>>>> location. The things that tells these tables apart (apart from their 
>>>>>>> names)
>>>>>>> is that they have unique properties. These properties help me choose the
>>>>>>> correct rows from the underlying file. I use a single storage handler
>>>>>>> (accompanied by a single InputFormat and a single Serde) . The first
>>>>>>> columns in all tables are the same but the last (a struct) is unique and
>>>>>>> is constructed from the Serde (with help of the serde properties). A
>>>>>>> simplified version of the tables look like so:
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE Table1 (
>>>>>>>   column1 STRING,
>>>>>>>   column2 STRING)
>>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>>>>   LOCATION 'mylocation'
>>>>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE Table2 (
>>>>>>>   column1 STRING,
>>>>>>>   column2 STRING)
>>>>>>>   STORED BY 'MyStorageHandler'
>>>>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>>>>   LOCATION 'mylocation'
>>>>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>>>>
>>>>>>>
>>>>>>> All works well for simple select queries towards the two tables. The
>>>>>>> following query gives very strange results though:
>>>>>>>
>>>>>>> SELECT * FROM (
>>>>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>>>>   union all
>>>>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>>>>   ) my_union
>>>>>>> ORDER BY my_union.column1
>>>>>>>
>>>>>>>
>>>>>>> It seems like one job task is created per file stored in the table
>>>>>>> location. This task gets the table properties from the second table and 
>>>>>>> in
>>>>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>>>>
>>>>>>> I would have expected that hive would need to iterated the source
>>>>>>> files two times using two different tasks (with the correct table
>>>>>>> properties passed) in order to get this to work.
>>>>>>>
>>>>>>> Anyone here that can shed some light on this scenario?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Petter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pointing multiple external tables to the same location

Reply via email to