Re: [PR] HIVE-28259: Common table expression detection and rewrites using CBO [hive]

via GitHub Sun, 07 Jul 2024 19:17:43 -0700


amansinha100 commented on code in PR #5249:
URL: https://github.com/apache/hive/pull/5249#discussion_r1667050773



##########
ql/src/test/results/clientpositive/perf/tpcds30tb/tez/cbo_query31.q.out:
##########
@@ -1,3 +1,27 @@
+CTE Suggestion 0:HiveProject(ss_addr_sk=[$5], ss_ext_sales_price=[$14], 
ss_sold_date_sk=[$22])
+  HiveFilter(condition=[AND(IS NOT NULL($22), IS NOT NULL($5))])
+    HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
+
+CTE Suggestion 1:HiveProject(ws_bill_addr_sk=[$6], ws_ext_sales_price=[$22], 
ws_sold_date_sk=[$33])
+  HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($6))])
+    HiveTableScan(table=[[default, web_sales]], table:alias=[web_sales])
+
+CTE Suggestion 2:HiveProject(ca_address_sk=[$0], ca_county=[$7])
+  HiveFilter(condition=[IS NOT NULL($7)])
+    HiveTableScan(table=[[default, customer_address]], 
table:alias=[customer_address])
+
+CTE Suggestion 3:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 1), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 4:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 3), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 5:HiveProject(d_date_sk=[$0])
+  HiveFilter(condition=[AND(=($10, 2), =($6, 2000))])
+    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+

Review Comment:
   I suppose a more sophisticated suggester would combine CTEs 3, 4, 5 into a 
single one with an IN clause for the HiveFilter, i.e.  $10 IN (1, 2, 3) AND $6 
= 2000



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -1737,6 +1747,14 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema 
relOptSchema, SchemaPlu
       perfLogger.perfLogBegin(this.getClass().getName(), 
PerfLogger.POSTJOIN_ORDERING);
       calcitePlan = applyPostJoinOrderingTransform(calcitePlan, 
mdProvider.getMetadataProvider(), executorProvider);
       perfLogger.perfLogEnd(this.getClass().getName(), 
PerfLogger.POSTJOIN_ORDERING);
+      // Perform the CTE rewriting near the end of CBO transformations to 
avoid interference of the new HiveTableSpool 
+      // operator with other rules (especially those related to constant 
folding and branch pruning).

Review Comment:
   It would be useful to have a simple q test with branch pruning on both sides 
of a UNION ALL (both sides have CTE with FALSE predicate).. in that case the  
branches should be pruned first and the CTE part should be skipped altogether. 



##########
ql/src/test/queries/clientpositive/perf/cbo_query11.q:
##########
@@ -1,3 +1,4 @@
+set 
hive.optimize.cte.suggester.class=org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionPrintSuggester;

Review Comment:
   In terms of the q file changes, this setting is replicated in all the 
cbo_queryxx files but would it make sense to just have a separate folder 
perf/cte with targeted set of tpcds tests where this config option is set ?  It 
is possible we may add other cte specific configurations (e.g some limit on the 
number of CTEs to consider to reduce planning time), and it may be better to 
have an isolated suite. 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveMaterializedViewUtils.java:
##########
@@ -536,4 +545,33 @@ private static Map<String, SnapshotContext> 
getSnapshotOf(Hive db, Set<TableName
     }
     return snapshot;
   }
+
+  public static RelOptMaterialization createCTEMaterialization(String 
viewName, RelNode body, HiveConf conf) {
+    RelOptCluster cluster = body.getCluster();
+    List<ColumnInfo> columns = new ArrayList<>();
+    for (RelDataTypeField f : body.getRowType().getFieldList()) {
+      TypeInfo info = TypeConverter.convert(f.getType());
+      columns.add(new ColumnInfo(f.getName(), info, f.getType().isNullable(), 
viewName, false, false));
+    }
+    List<String> fullName = Arrays.asList("cte", viewName);
+    org.apache.hadoop.hive.metastore.api.Table metaTable = 
Table.getEmptyTable("cte", viewName);
+    metaTable.setTemporary(true);
+    try {
+      // Setting a location avoids a NPE when fetching statistics
+      
metaTable.getSd().setLocation(SessionState.generateTempTableLocation(conf));
+    } catch (MetaException e) {
+      throw new RuntimeException(e);
+    }
+    Table hiveTable = new Table(metaTable);
+    hiveTable.setMaterializedTable(true);
+    RelOptHiveTable optTable =
+        new RelOptHiveTable(null, cluster.getTypeFactory(), fullName, 
body.getRowType(), hiveTable, columns,
+            Collections.emptyList(), Collections.emptyList(), new HiveConf(), 
Hive.getThreadLocal(),
+            new QueryTables(true), new HashMap<>(), new HashMap<>(), new 
AtomicInteger());
+    optTable.setRowCount(cluster.getMetadataQuery().getRowCount(body));

Review Comment:
   Besides the rowcount, do we want to propagate other properties to the CTE 
such as distribution and collation ? If these are not expected to be used for 
the TableSpool, then it does not matter. It could be a future area of 
improvement. 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/CommonTableExpressionRegistry.java:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite;
+
+import org.apache.calcite.rel.RelNode;
+import org.apache.hadoop.hive.common.classification.InterfaceAudience;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Stream;
+
+/**
+ * A registry of common table expressions for a given query.
+ * <p>The registry is only meant to hold common expressions for a single 
query.</p>
+ * <p>The class is not thread-safe.</p>
+ */
[email protected]
+public final class CommonTableExpressionRegistry {
+  /**
+   * A unique collection of common table expressions.
+   * <p>The expressions may contain internal planning concepts such as {@link 
org.apache.calcite.plan.hep.HepRelVertex}.
+   * </p>
+   */
+  private final Map<String, RelNode> ctes = new HashMap<>();

Review Comment:
   Currently, all CTEs are weighted equally.  A potentially future enhancement 
could be one where there is weight/rank associated with it which would feed 
into the optimizer decisions. This only matters if we want to  restrict the 
number of CTEs to be considered (for shorter compilation time). 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/TableScanToSpoolRule.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptMaterialization;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.prepare.RelOptTableImpl;
+import org.apache.calcite.rel.core.Spool;
+import org.apache.calcite.rel.core.TableScan;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import static 
org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories.HIVE_SPOOL_FACTORY;
+
+public class TableScanToSpoolRule extends RelOptRule {
+  /**
+   * Track created spools to avoid introducing more than one.
+   */
+  private final Set<String> spools = new HashSet<>();
+  private final int referenceThreshold;
+  private final Map<List<String>, Long> tableOccurrences;
+  public TableScanToSpoolRule(Map<List<String>, Long> tableOccurrences, int 
referenceThreshold) {
+    super(operand(TableScan.class, none()));
+    this.referenceThreshold = referenceThreshold;
+    this.tableOccurrences = tableOccurrences;
+    assert referenceThreshold > 0;

Review Comment:
   nit: should this assert for > 1 ?  Since CTE should only be applicable if 
there are more than 1 references. 



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/CommonRelSubExprRegisterRule.java:
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.CommonRelSubExprRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionRegistry;
+
+import com.google.common.collect.Multimap;
+
+import java.util.function.Predicate;
+
+/**
+ * Rule for saving relational expressions that appear more than once in a 
query tree to the planner context.
+ */
+public final class CommonRelSubExprRegisterRule extends CommonRelSubExprRule {
+  public static final CommonRelSubExprRegisterRule JOIN = new 
CommonRelSubExprRegisterRule(operand(Join.class, any()));

Review Comment:
   Are we supporting any type of joins in the CTE ?  IIRC for MVs, there was 
some restriction around full outer join (need to confirm this). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28259: Common table expression detection and rewrites using CBO [hive]

Reply via email to