[ 
https://issues.apache.org/jira/browse/HIVE-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Martinez Moller updated HIVE-15026:
------------------------------------------
    Description: 
Note: I am trying to simplify a real case scenario we are having and 
simplifying the queries for the example. Hope they make sense and that the 
proposal I am doing can be understood. The real query is a lot more complex and 
long.

When performing a query of this type:

------------------------------------------------------------------
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TABLE_A
WHERE COLUMNA=1 AND COLUMND='Case 1'
UNION ALL
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TABLE_A
WHERE COLUMNA=10 AND COLUMNE='Case 2'
------------------------------------------------------------------

This creates Three Stages. First Stage is FULL SCAN of TABLE_A + Filter 
(COLUMNA=1/COLUMND='Case 1'),  Second Stage is FULL SCAN of TABLE_A again + 
Filter (COLUMNA=10/COLUMNE='Case 2'), and third stage is the UNION ALL.

TABLE_A has 2TB data of information.
But COLUMNA=1 and COLUMNA=10 filter all together only 2GB of information.

So I thought to use:

--------------------------------------------------------------
WITH TEMP_VIEW AS
(SELECT COLUMNA,COLUMNB,COLUMNC,COLUMND
FROM TABLE_A
WHERE COLUMNA=1 AND COLUMNA=10)
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TEMP_VIEW
WHERE COLUMNA=1 AND COLUMND='Case 1'
UNION ALL
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TEMP_VIEW
WHERE COLUMNA=10 AND COLUMNE='Case 2'
---------------------------------------------------------------

I thought that with this it would create 4 Stages:
- Stage 1: Full Scan of TABLE_A and generate intermediate data
- Stage 2: In the data of Stage 1 Filter (COLUMNA=1/COLUMND='Case 1')
- Stage 3: In the data of Stage 1 Filter (COLUMNA=10/COLUMNE='Case 2')
- Stage 4: Union ALL

With this instead of 4TB being read from disk, only 2TB+4GB (twice going 
through the view) would be read (In our case complexity is even bigger and we 
will be saving 20TB reading)

But it does the same than in the original query. It internally pushes the 
predicates of the "WITH" query in the two parts of the UNION.

It would be good to have a control on this, or for the optimizer to choose the 
best approach using histogram/statistics information.

For those knowing Oracle RDBMS this is equivalent to the MERGE/NO_MERGE and 
NEST behaviour:
http://www.dba-oracle.com/t_hint_no_merge.htm as an explanation...

Other approaches for my example could apply, as partitioning by COLUMNA of 
BUCKETING. But are not applicable in our case as COLUMNA is not commonly used 
when accessing this table.

The point of this JIRA is to add a functionality similar to the one of Oracle 
(not Merging the query, but generating an in-memory/disk temporary view) both 
for "WITH" clauses and VIEWS.

This is very very commonly used in Data Ware Houses managing big amounts of 
data and provides big performance benefits.





  was:
Note: I am trying to simplify a real case scenario we are having, but 
simplifying the queries for the example. Hope they make sense and that the 
problem itself can be explained. The real query is a lot more complex and long.

When performing a query of this type:

------------------------------------------------------------------
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TABLE_A
WHERE COLUMNA=1 AND COLUMND='Case 1'
UNION ALL
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TABLE_A
WHERE COLUMNA=10 AND COLUMNE='Case 2'
------------------------------------------------------------------

This creates Three Stages. First Stage is FULL SCAN of TABLE_A + Filter 
(COLUMNA=1/COLUMND='Case 1'),  Second Stage is FULL SCAN of TABLE_A again + 
Filter (COLUMNA=10/COLUMNE='Case 2'), and third stage is the UNION ALL.

TABLE_A has 2TB data of information.
But COLUMNA=1 and COLUMNA=10 filter all together only 2GB of information.

So I thought to use:

--------------------------------------------------------------
WITH TEMP_VIEW AS
(SELECT COLUMNA,COLUMNB,COLUMNC,COLUMND
FROM TABLE_A
WHERE COLUMNA=1 AND COLUMNA=10)
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TEMP_VIEW
WHERE COLUMNA=1 AND COLUMND='Case 1'
UNION ALL
SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
FROM TEMP_VIEW
WHERE COLUMNA=10 AND COLUMNE='Case 2'
---------------------------------------------------------------

I thought that with this it would create 4 Stages:
- Stage 1: Full Scan of TABLE_A and generate intermediate data
- Stage 2: In the data of Stage 1 Filter (COLUMNA=1/COLUMND='Case 1')
- Stage 3: In the data of Stage 1 Filter (COLUMNA=10/COLUMNE='Case 2')
- Stage 4: Union ALL

With this instead of 4TB being read from disk, only 2TB+4GB (twice going 
through the view) would be read (In our case complexity is even bigger and we 
will be saving 20TB reading)

But it does the same than in the original query. It internally pushes the 
predicates of the "WITH" query in the two parts of the UNION.

It would be good to have a control on this, or for the optimizer to choose the 
best approach using histogram/statistics information.

For those knowing Oracle RDBMS this is equivalent to the MERGE/NO_MERGE and 
NEST behaviour:
http://www.dba-oracle.com/t_hint_no_merge.htm as an explanation...

Other approaches for my example could apply, as partitioning by COLUMNA of 
BUCKETING. But are not applicable in our case as COLUMNA is not commonly used 
when accessing this table.

The point of this JIRA is to add a functionality similar to the one of Oracle 
(not Merging the query, but generating an in-memory/disk temporary view) both 
for "WITH" clauses and VIEWS.

This is very very commonly used in Data Ware Houses managing big amounts of 
data and provides big performance benefits.






> Option to not merge the views
> -----------------------------
>
>                 Key: HIVE-15026
>                 URL: https://issues.apache.org/jira/browse/HIVE-15026
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer
>            Reporter: Carlos Martinez Moller
>
> Note: I am trying to simplify a real case scenario we are having and 
> simplifying the queries for the example. Hope they make sense and that the 
> proposal I am doing can be understood. The real query is a lot more complex 
> and long.
> When performing a query of this type:
> ------------------------------------------------------------------
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TABLE_A
> WHERE COLUMNA=1 AND COLUMND='Case 1'
> UNION ALL
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TABLE_A
> WHERE COLUMNA=10 AND COLUMNE='Case 2'
> ------------------------------------------------------------------
> This creates Three Stages. First Stage is FULL SCAN of TABLE_A + Filter 
> (COLUMNA=1/COLUMND='Case 1'),  Second Stage is FULL SCAN of TABLE_A again + 
> Filter (COLUMNA=10/COLUMNE='Case 2'), and third stage is the UNION ALL.
> TABLE_A has 2TB data of information.
> But COLUMNA=1 and COLUMNA=10 filter all together only 2GB of information.
> So I thought to use:
> --------------------------------------------------------------
> WITH TEMP_VIEW AS
> (SELECT COLUMNA,COLUMNB,COLUMNC,COLUMND
> FROM TABLE_A
> WHERE COLUMNA=1 AND COLUMNA=10)
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TEMP_VIEW
> WHERE COLUMNA=1 AND COLUMND='Case 1'
> UNION ALL
> SELECT COLUMNA, COLUMNB, MAX (COLUMNC)
> FROM TEMP_VIEW
> WHERE COLUMNA=10 AND COLUMNE='Case 2'
> ---------------------------------------------------------------
> I thought that with this it would create 4 Stages:
> - Stage 1: Full Scan of TABLE_A and generate intermediate data
> - Stage 2: In the data of Stage 1 Filter (COLUMNA=1/COLUMND='Case 1')
> - Stage 3: In the data of Stage 1 Filter (COLUMNA=10/COLUMNE='Case 2')
> - Stage 4: Union ALL
> With this instead of 4TB being read from disk, only 2TB+4GB (twice going 
> through the view) would be read (In our case complexity is even bigger and we 
> will be saving 20TB reading)
> But it does the same than in the original query. It internally pushes the 
> predicates of the "WITH" query in the two parts of the UNION.
> It would be good to have a control on this, or for the optimizer to choose 
> the best approach using histogram/statistics information.
> For those knowing Oracle RDBMS this is equivalent to the MERGE/NO_MERGE and 
> NEST behaviour:
> http://www.dba-oracle.com/t_hint_no_merge.htm as an explanation...
> Other approaches for my example could apply, as partitioning by COLUMNA of 
> BUCKETING. But are not applicable in our case as COLUMNA is not commonly used 
> when accessing this table.
> The point of this JIRA is to add a functionality similar to the one of Oracle 
> (not Merging the query, but generating an in-memory/disk temporary view) both 
> for "WITH" clauses and VIEWS.
> This is very very commonly used in Data Ware Houses managing big amounts of 
> data and provides big performance benefits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to