amansinha100 commented on code in PR #5249:
URL: https://github.com/apache/hive/pull/5249#discussion_r1667050773
##########
ql/src/test/results/clientpositive/perf/tpcds30tb/tez/cbo_query31.q.out:
##########
@@ -1,3 +1,27 @@
+CTE Suggestion 0:HiveProject(ss_addr_sk=[$5], ss_ext_sales_price=[$14],
ss_sold_date_sk=[$22])
+ HiveFilter(condition=[AND(IS NOT NULL($22), IS NOT NULL($5))])
+ HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
+
+CTE Suggestion 1:HiveProject(ws_bill_addr_sk=[$6], ws_ext_sales_price=[$22],
ws_sold_date_sk=[$33])
+ HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($6))])
+ HiveTableScan(table=[[default, web_sales]], table:alias=[web_sales])
+
+CTE Suggestion 2:HiveProject(ca_address_sk=[$0], ca_county=[$7])
+ HiveFilter(condition=[IS NOT NULL($7)])
+ HiveTableScan(table=[[default, customer_address]],
table:alias=[customer_address])
+
+CTE Suggestion 3:HiveProject(d_date_sk=[$0])
+ HiveFilter(condition=[AND(=($10, 1), =($6, 2000))])
+ HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 4:HiveProject(d_date_sk=[$0])
+ HiveFilter(condition=[AND(=($10, 3), =($6, 2000))])
+ HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
+CTE Suggestion 5:HiveProject(d_date_sk=[$0])
+ HiveFilter(condition=[AND(=($10, 2), =($6, 2000))])
+ HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
+
Review Comment:
I suppose a more sophisticated suggester would combine CTEs 3, 4, 5 into a
single one with an IN clause for the HiveFilter, i.e. $10 IN (1, 2, 3) AND $6
= 2000
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -1737,6 +1747,14 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema
relOptSchema, SchemaPlu
perfLogger.perfLogBegin(this.getClass().getName(),
PerfLogger.POSTJOIN_ORDERING);
calcitePlan = applyPostJoinOrderingTransform(calcitePlan,
mdProvider.getMetadataProvider(), executorProvider);
perfLogger.perfLogEnd(this.getClass().getName(),
PerfLogger.POSTJOIN_ORDERING);
+ // Perform the CTE rewriting near the end of CBO transformations to
avoid interference of the new HiveTableSpool
+ // operator with other rules (especially those related to constant
folding and branch pruning).
Review Comment:
It would be useful to have a simple q test with branch pruning on both sides
of a UNION ALL (both sides have CTE with FALSE predicate).. in that case the
branches should be pruned first and the CTE part should be skipped altogether.
##########
ql/src/test/queries/clientpositive/perf/cbo_query11.q:
##########
@@ -1,3 +1,4 @@
+set
hive.optimize.cte.suggester.class=org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionPrintSuggester;
Review Comment:
In terms of the q file changes, this setting is replicated in all the
cbo_queryxx files but would it make sense to just have a separate folder
perf/cte with targeted set of tpcds tests where this config option is set ? It
is possible we may add other cte specific configurations (e.g some limit on the
number of CTEs to consider to reduce planning time), and it may be better to
have an isolated suite.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveMaterializedViewUtils.java:
##########
@@ -536,4 +545,33 @@ private static Map<String, SnapshotContext>
getSnapshotOf(Hive db, Set<TableName
}
return snapshot;
}
+
+ public static RelOptMaterialization createCTEMaterialization(String
viewName, RelNode body, HiveConf conf) {
+ RelOptCluster cluster = body.getCluster();
+ List<ColumnInfo> columns = new ArrayList<>();
+ for (RelDataTypeField f : body.getRowType().getFieldList()) {
+ TypeInfo info = TypeConverter.convert(f.getType());
+ columns.add(new ColumnInfo(f.getName(), info, f.getType().isNullable(),
viewName, false, false));
+ }
+ List<String> fullName = Arrays.asList("cte", viewName);
+ org.apache.hadoop.hive.metastore.api.Table metaTable =
Table.getEmptyTable("cte", viewName);
+ metaTable.setTemporary(true);
+ try {
+ // Setting a location avoids a NPE when fetching statistics
+
metaTable.getSd().setLocation(SessionState.generateTempTableLocation(conf));
+ } catch (MetaException e) {
+ throw new RuntimeException(e);
+ }
+ Table hiveTable = new Table(metaTable);
+ hiveTable.setMaterializedTable(true);
+ RelOptHiveTable optTable =
+ new RelOptHiveTable(null, cluster.getTypeFactory(), fullName,
body.getRowType(), hiveTable, columns,
+ Collections.emptyList(), Collections.emptyList(), new HiveConf(),
Hive.getThreadLocal(),
+ new QueryTables(true), new HashMap<>(), new HashMap<>(), new
AtomicInteger());
+ optTable.setRowCount(cluster.getMetadataQuery().getRowCount(body));
Review Comment:
Besides the rowcount, do we want to propagate other properties to the CTE
such as distribution and collation ? If these are not expected to be used for
the TableSpool, then it does not matter. It could be a future area of
improvement.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/CommonTableExpressionRegistry.java:
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite;
+
+import org.apache.calcite.rel.RelNode;
+import org.apache.hadoop.hive.common.classification.InterfaceAudience;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Stream;
+
+/**
+ * A registry of common table expressions for a given query.
+ * <p>The registry is only meant to hold common expressions for a single
query.</p>
+ * <p>The class is not thread-safe.</p>
+ */
[email protected]
+public final class CommonTableExpressionRegistry {
+ /**
+ * A unique collection of common table expressions.
+ * <p>The expressions may contain internal planning concepts such as {@link
org.apache.calcite.plan.hep.HepRelVertex}.
+ * </p>
+ */
+ private final Map<String, RelNode> ctes = new HashMap<>();
Review Comment:
Currently, all CTEs are weighted equally. A potentially future enhancement
could be one where there is weight/rank associated with it which would feed
into the optimizer decisions. This only matters if we want to restrict the
number of CTEs to be considered (for shorter compilation time).
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/TableScanToSpoolRule.java:
##########
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptMaterialization;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.prepare.RelOptTableImpl;
+import org.apache.calcite.rel.core.Spool;
+import org.apache.calcite.rel.core.TableScan;
+
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import static
org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories.HIVE_SPOOL_FACTORY;
+
+public class TableScanToSpoolRule extends RelOptRule {
+ /**
+ * Track created spools to avoid introducing more than one.
+ */
+ private final Set<String> spools = new HashSet<>();
+ private final int referenceThreshold;
+ private final Map<List<String>, Long> tableOccurrences;
+ public TableScanToSpoolRule(Map<List<String>, Long> tableOccurrences, int
referenceThreshold) {
+ super(operand(TableScan.class, none()));
+ this.referenceThreshold = referenceThreshold;
+ this.tableOccurrences = tableOccurrences;
+ assert referenceThreshold > 0;
Review Comment:
nit: should this assert for > 1 ? Since CTE should only be applicable if
there are more than 1 references.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/CommonRelSubExprRegisterRule.java:
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.CommonRelSubExprRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import
org.apache.hadoop.hive.ql.optimizer.calcite.CommonTableExpressionRegistry;
+
+import com.google.common.collect.Multimap;
+
+import java.util.function.Predicate;
+
+/**
+ * Rule for saving relational expressions that appear more than once in a
query tree to the planner context.
+ */
+public final class CommonRelSubExprRegisterRule extends CommonRelSubExprRule {
+ public static final CommonRelSubExprRegisterRule JOIN = new
CommonRelSubExprRegisterRule(operand(Join.class, any()));
Review Comment:
Are we supporting any type of joins in the CTE ? IIRC for MVs, there was
some restriction around full outer join (need to confirm this).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]