Views and partitions and performance

Burak Gürsoy Tue, 11 Feb 2014 08:06:04 -0800

Hi,

Some background information:


We have a really large base table (~1P) — (lets call it raw_table)
partitioned like yyyy_mm_dd=2014-02-11/hh=12/flag=1 (or 
yyyy_mm_dd=2014-02-11/hh=12/flag=0)

I’m testing creating a view on that table to be able to have a consistent
user accessible interface instead of letting the users to use the actual 
table (we’ve changed the format of the table multiple times in the past
and we’ll possibly do so again).

I am creating the view like this[1] and I’ve also tested view partitions[2].

The version of Hive is 0.10.0

The issues I came across so far are:

1) I couldn’t figure out the actual purpose of the view partitions.
The document[3] lists several use cases and mentioned
“infer view partitions automatically based on the partitions of the underlying 
tables"
which leads to an open ticket[4] which in turn lists another one (HIVE-1941)
which is closed. However I fail to see how to create a derived view partition.
And with the definition I’ve used[2] they don’t seem to have the same effect.

2) You need to specify the partitions in the query to the view explicitly
to have the partition pruning come into effect. However, when I
say “select * from raw_table limit 10” it selects a subset somehow 
(explain and explain extended does not show any targeted partitions with
such queries — is this a bug?). But doing the same query on the view leads 
to a full table scan.

2.1) I see a similar behaviour with things like 
“select ... from myview tablesample (bucket 1 out of $bucketsize on $field)

I’m wondering if this is just the way the views work at the moment or is
there a workaround to improve, lets say: silly queries? As the interface 
seems inconsistent between the table and view access.

Thanks,
Burak



=========[1]=========

CREATE VIEW myview AS SELECT  `combined`.`column1`,
        `combined`.`column2`,
        `combined`.`yyyy_mm_dd`,
        `combined`.`hh`,
        `combined`.`column3`,
        `combined`.`column4`,
        `combined`.`column5`
FROM
(
    SELECT
        `raw`.`column1`,
        `raw`.`column2`,
        `raw`.`yyyy_mm_dd`,
        `raw`.`hh`,
        `raw`.`column3`,
        `raw`.`column4`,
        `raw`.`column5`
    FROM  `default`.`raw_table` `raw`
) `combined`

=========[2]=========

CREATE VIEW myview
PARTITIONED ON (
        yyyy_mm_dd,
        hh,
        type,
        persona,
        dc
)
AS ( ... )


=========[3]=========[

https://cwiki.apache.org/confluence/display/Hive/PartitionedViews

Views and partitions and performance

Reply via email to