zabetak commented on code in PR #5249:
URL: https://github.com/apache/hive/pull/5249#discussion_r1685370053


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views/HiveMaterializedViewUtils.java:
##########
@@ -536,4 +545,33 @@ private static Map<String, SnapshotContext> 
getSnapshotOf(Hive db, Set<TableName
     }
     return snapshot;
   }
+
+  public static RelOptMaterialization createCTEMaterialization(String 
viewName, RelNode body, HiveConf conf) {
+    RelOptCluster cluster = body.getCluster();
+    List<ColumnInfo> columns = new ArrayList<>();
+    for (RelDataTypeField f : body.getRowType().getFieldList()) {
+      TypeInfo info = TypeConverter.convert(f.getType());
+      columns.add(new ColumnInfo(f.getName(), info, f.getType().isNullable(), 
viewName, false, false));
+    }
+    List<String> fullName = Arrays.asList("cte", viewName);
+    org.apache.hadoop.hive.metastore.api.Table metaTable = 
Table.getEmptyTable("cte", viewName);
+    metaTable.setTemporary(true);
+    try {
+      // Setting a location avoids a NPE when fetching statistics
+      
metaTable.getSd().setLocation(SessionState.generateTempTableLocation(conf));
+    } catch (MetaException e) {
+      throw new RuntimeException(e);
+    }
+    Table hiveTable = new Table(metaTable);
+    hiveTable.setMaterializedTable(true);
+    RelOptHiveTable optTable =
+        new RelOptHiveTable(null, cluster.getTypeFactory(), fullName, 
body.getRowType(), hiveTable, columns,
+            Collections.emptyList(), Collections.emptyList(), new HiveConf(), 
Hive.getThreadLocal(),
+            new QueryTables(true), new HashMap<>(), new HashMap<>(), new 
AtomicInteger());
+    optTable.setRowCount(cluster.getMetadataQuery().getRowCount(body));

Review Comment:
   This is a very good question. The short answer is that at the moment there 
is probably no benefit in doing so. However, as this feature evolves (better 
cost-model, implementation of `hive.cbo.returnpath.hiveop` for CTEs) it may be 
beneficial to consider traits. Below, you can find some longer answers to more 
specific parts of this question regarding the state right now and what we could 
possibly do in the future.
   
   **Why are we explicitly setting the row count here?** 
   Setting the row count mainly helps to break tie breaks across CTEs. Since 
the `RelOptHiveTable` that is created here does not have (yet) a physical 
equivalent the optimizer will attempt to gather statistics following the 
default logic, which will probably trigger metastore (and possibly HDFS) calls, 
and the result will be similar to that of an empty table (~zero rows). This is 
problematic because it incurs unnecessary overhead and gives the same cost to 
every CTE. 
   
   **How is the row count estimated?**
   The row count is estimated by using `HiveDefaultRelMetadataProvider`  and 
the logic in `HiveRelMdRowCount`, `HiveRelMdDistinctRowCount`, etc. If we want 
we could plug another metadata provider but at the moment I don't find it 
necessary.
   
   **Should we propagate distribution and collation?**
   Trait propagation is not necessary at the moment. The distribution and 
collation traits are not used by the CTE rewriting transformation. In addition, 
the CTE rewriting phase is applied at the end of the CBO transformations so 
even if the traits are lost there are no subsequent steps to exploit them.
   
   **Can we guarantee trait propagation when using CTEs?**
   If the `body` of the view/CTE contains some traits those will remain intact 
by this part of the code. However, since the `body` is created by the suggester 
the traits are not necessarily retained; the propagation behavior is strongly 
implementation dependent. Some suggesters may be able to retain and conserve 
traits while other may destroy them completely with the latter being more 
plausible.
   
   **Is the TableSpool using traits in some way?**
   No, because as I explained above traits are not used in the CTE rewriting 
phase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to