[ https://issues.apache.org/jira/browse/HIVE-15026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592677#comment-15592677 ]
Xuefu Zhang commented on HIVE-15026: ------------------------------------ In general, Hive "expands" view/cte in the logical layer as they are not materialized. However, the problem isn't applicable for Hive on Spark though I know it's there for MapReduce. Hive on Spark caches the result if it determines that a temp result (such as the output of your common table expression) is reusable. If you're using other engines, you might consider create a temp table (so as to materialize the view/cte) and use it in your subsequent processing. > Option to not merge the views > ----------------------------- > > Key: HIVE-15026 > URL: https://issues.apache.org/jira/browse/HIVE-15026 > Project: Hive > Issue Type: Improvement > Components: Logical Optimizer, Physical Optimizer > Reporter: Carlos Martinez Moller > > Note: I am trying to simplify a real case scenario we are having and > simplifying the queries for the example. Hope they make sense and that the > proposal I am doing can be understood. The real query is a lot more complex > and long. > When performing a query of this type: > ------------------------------------------------------------------ > SELECT COLUMNA, COLUMNB, MAX (COLUMNC) > FROM TABLE_A > WHERE COLUMNA=1 AND COLUMND='Case 1' > UNION ALL > SELECT COLUMNA, COLUMNB, MAX (COLUMNC) > FROM TABLE_A > WHERE COLUMNA=10 AND COLUMNE='Case 2' > ------------------------------------------------------------------ > This creates Three Stages. First Stage is FULL SCAN of TABLE_A + Filter > (COLUMNA=1/COLUMND='Case 1'), Second Stage is FULL SCAN of TABLE_A again + > Filter (COLUMNA=10/COLUMNE='Case 2'), and third stage is the UNION ALL. > TABLE_A has 2TB data of information. > But COLUMNA=1 and COLUMNA=10 filter all together only 2GB of information. > So I thought to use: > -------------------------------------------------------------- > WITH TEMP_VIEW AS > (SELECT COLUMNA,COLUMNB,COLUMNC,COLUMND > FROM TABLE_A > WHERE COLUMNA=1 AND COLUMNA=10) > SELECT COLUMNA, COLUMNB, MAX (COLUMNC) > FROM TEMP_VIEW > WHERE COLUMNA=1 AND COLUMND='Case 1' > UNION ALL > SELECT COLUMNA, COLUMNB, MAX (COLUMNC) > FROM TEMP_VIEW > WHERE COLUMNA=10 AND COLUMNE='Case 2' > --------------------------------------------------------------- > I thought that with this it would create 4 Stages: > - Stage 1: Full Scan of TABLE_A and generate intermediate data > - Stage 2: In the data of Stage 1 Filter (COLUMNA=1/COLUMND='Case 1') > - Stage 3: In the data of Stage 1 Filter (COLUMNA=10/COLUMNE='Case 2') > - Stage 4: Union ALL > With this instead of 4TB being read from disk, only 2TB+4GB (twice going > through the view) would be read (In our case complexity is even bigger and we > will be saving 20TB reading) > But it does the same than in the original query. It internally pushes the > predicates of the "WITH" query in the two parts of the UNION. > It would be good to have a control on this, or for the optimizer to choose > the best approach using histogram/statistics information. > For those knowing Oracle RDBMS this is equivalent to the MERGE/NO_MERGE and > NEST behaviour: > http://www.dba-oracle.com/t_hint_no_merge.htm as an explanation... > Other approaches for my example could apply, as partitioning by COLUMNA of > BUCKETING. But are not applicable in our case as COLUMNA is not commonly used > when accessing this table. > The point of this JIRA is to add a functionality similar to the one of Oracle > (not Merging the query, but generating an in-memory/disk temporary view) both > for "WITH" clauses and VIEWS. > This is very very commonly used in Data Ware Houses managing big amounts of > data and provides big performance benefits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)