[jira] [Work logged] (HIVE-24241) Enable SharedWorkOptimizer to merge downstream operators after an optimization step

ASF GitHub Bot (Jira) Thu, 29 Oct 2020 21:10:32 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24241?focusedWorklogId=506526&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-506526
 ]


ASF GitHub Bot logged work on HIVE-24241:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Oct/20 04:09
            Start Date: 30/Oct/20 04:09
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1562:
URL: https://github.com/apache/hive/pull/1562#discussion_r514803487



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorGraph.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.optimizer;
+
+import java.io.File;
+import java.io.PrintWriter;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.hadoop.hive.ql.exec.AppMasterEventOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.ReduceSinkOperator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePointLookupOptimizerRule.DiGraph;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.SemiJoinBranchInfo;
+import org.apache.hadoop.hive.ql.plan.DynamicPruningEventDesc;
+
+import com.google.common.collect.Sets;
+
+public class OperatorGraph {
+
+  /**
+   * A directed graph extended with support to check dag property.
+   */
+  static class DagGraph<V, E> extends DiGraph<V, E> {

Review comment:
       In the meantime, maybe DiGraph could be made a top class.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java
##########
@@ -189,115 +189,121 @@ public RexNode analyzeRexNode(RexBuilder rexBuilder, 
RexNode condition) {
     return newCondition;
   }
 
-  /**
-   * Transforms inequality candidates into [NOT] BETWEEN calls.
-   *
-   */
-  protected static class RexTransformIntoBetween extends RexShuttle {
-    private final RexBuilder rexBuilder;
+  public static class DiGraph<V, E> {

Review comment:
       Left the comment in other class but I was thinking that it may be a good 
idea to promote this to top class (at least until we replace it by any other 
library version as we were discussing).

##########
File path: 
ql/src/test/results/clientpositive/llap/dynamic_partition_pruning.q.out
##########
@@ -4317,7 +4301,7 @@ STAGE PLANS:
                     outputColumnNames: ds
                     Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Group By Operator
-                      aggregations: max(ds)
+                      aggregations: min(ds)

Review comment:
       Any idea why this is happening?

##########
File path: 
ql/src/test/results/clientpositive/llap/dynamic_partition_pruning.q.out
##########
@@ -4277,37 +4277,21 @@ STAGE PLANS:
                   alias: srcpart
                   filterExpr: ds is not null (type: boolean)
                   Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
-                  Filter Operator
-                    predicate: ds is not null (type: boolean)

Review comment:
       Note that the filter operator is removed. We need to be careful here 
because not all input formats guarantee that the filter expression is being 
applied / does not return false positives. I would expect the Filter remains 
but only a single time?

##########
File path: ql/src/test/results/clientpositive/perf/tez/query95.q.out
##########
@@ -128,7 +128,7 @@ Stage-0
                                   Select Operator [SEL_235] (rows=144002668 
width=7)
                                     Output:["_col0","_col1"]
                                     Filter Operator [FIL_234] (rows=144002668 
width=7)
-                                      predicate:(ws_order_number is not null 
and (ws_order_number is not null or ws_order_number is not null))
+                                      predicate:ws_order_number is not null

Review comment:
       good!

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorGraph.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.optimizer;
+
+import java.io.File;
+import java.io.PrintWriter;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.hadoop.hive.ql.exec.AppMasterEventOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.ReduceSinkOperator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePointLookupOptimizerRule.DiGraph;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.SemiJoinBranchInfo;
+import org.apache.hadoop.hive.ql.plan.DynamicPruningEventDesc;
+
+import com.google.common.collect.Sets;
+
+public class OperatorGraph {

Review comment:
       Looks good. Could we add comments so it is clear what it is being 
represented, e.g., what is a cluster, etc. 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorGraph.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.optimizer;
+
+import java.io.File;
+import java.io.PrintWriter;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.hadoop.hive.ql.exec.AppMasterEventOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.ReduceSinkOperator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePointLookupOptimizerRule.DiGraph;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.SemiJoinBranchInfo;
+import org.apache.hadoop.hive.ql.plan.DynamicPruningEventDesc;
+
+import com.google.common.collect.Sets;
+
+public class OperatorGraph {
+
+  /**
+   * A directed graph extended with support to check dag property.
+   */
+  static class DagGraph<V, E> extends DiGraph<V, E> {
+
+    static class DagNode<V, E> extends Node<V, E> {
+      int dagIdx = 0;
+      public DagNode(V v) {
+        super(v);
+      }
+
+      @Override
+      public void addEdge(Edge<V, E> edge) {
+        if (edge.s == this) {
+          DagNode<V, E> t = (DagNode<V, E>) edge.t;
+          ensureDagIdxAtLeast(t.dagIdx + 1);
+          if (t.dagIdx > dagIdx) {
+            throw new IllegalArgumentException("adding this edge would violate 
dag properties");
+          }
+        }
+        super.addEdge(edge);
+      }
+
+      void ensureDagIdxAtLeast(int min) {
+        if (dagIdx >= min) {
+          return;
+        }
+        dagIdx = min;
+        for (Edge<V, E> e : edges) {
+          if(e.t == this) {
+            DagNode<V, E> s = (DagNode<V, E>) e.s;
+            s.ensureDagIdxAtLeast(min + 1);
+          }
+        }
+      }
+    }
+
+    @Override
+    protected Node<V, E> newNode(V s) {
+      return new DagNode<V, E>(s);
+    }
+  }
+
+  DagGraph<Operator<?>, OpEdge> g;
+
+  enum EdgeType {
+    FLOW, SEMIJOIN, DPP, TEST,
+  }
+
+  static class OpEdge {
+
+    private final EdgeType et;
+    private final int index;
+
+    public OpEdge(EdgeType et) {
+      this(et, 0);
+    }
+
+    public OpEdge(EdgeType et, int index) {
+      this.et = et;
+      this.index = index;
+    }
+
+  }
+
+
+  Map<Operator<?>, Cluster> nodeCluster = new HashMap<>();
+
+  public class Cluster {
+
+    Set<Operator<?>> members = new LinkedHashSet<>();
+
+    public void merge(Cluster o) {
+      for (Operator<?> node : o.members) {
+        add(node);
+      }
+      o.members.clear();
+    }
+
+    public void add(Operator<?> curr) {
+      nodeCluster.put(curr, this);
+      members.add(curr);
+    }
+
+  }
+
+
+  public OperatorGraph(ParseContext pctx) {
+    g = new DagGraph<Operator<?>, OperatorGraph.OpEdge>();
+    Set<Operator<?>> visited = Sets.newIdentityHashSet();
+    Set<Operator<?>> seen = Sets.newIdentityHashSet();
+
+    seen.addAll(pctx.getTopOps().values());
+    while (!seen.isEmpty()) {
+      Operator<?> curr = seen.iterator().next();
+      seen.remove(curr);
+      if (visited.contains(curr)) {
+        continue;
+      }
+
+      visited.add(curr);
+
+      Cluster currentCluster = nodeCluster.get(curr);
+      if (currentCluster == null) {
+        currentCluster=new Cluster();
+        currentCluster.add(curr);
+      }
+      List<Operator<?>> parents = curr.getParentOperators();
+      for (int i = 0; i < parents.size(); i++) {
+        Operator<?> p = parents.get(i);
+        g.putEdgeValue(p, curr, new OpEdge(EdgeType.FLOW, i));
+        if (p instanceof ReduceSinkOperator) {
+          // ignore cluster of parent RS
+          continue;
+        }
+        Cluster cluster = nodeCluster.get(p);
+        if (cluster != null) {
+          currentCluster.merge(cluster);
+        } else {
+          currentCluster.add(p);
+        }
+      }
+
+      SemiJoinBranchInfo sji = pctx.getRsToSemiJoinBranchInfo().get(curr);
+      if (sji != null) {
+        g.putEdgeValue(curr, sji.getTsOp(), new OpEdge(EdgeType.SEMIJOIN));
+        seen.add(sji.getTsOp());
+      }
+      if (curr instanceof AppMasterEventOperator) {
+        DynamicPruningEventDesc dped = (DynamicPruningEventDesc) 
curr.getConf();
+        TableScanOperator ts = dped.getTableScan();
+        g.putEdgeValue(curr, ts, new OpEdge(EdgeType.DPP));
+        seen.add(ts);
+      }
+
+      List<Operator<?>> ccc = curr.getChildOperators();
+      for (Operator<?> operator : ccc) {
+        seen.add(operator);
+      }
+    }
+  }
+
+  public void toDot(File outFile) throws Exception {

Review comment:
       Can we maybe move it to a utility class and add a comment? You could 
maybe move all these classes to `optimizer.graph` package

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/OperatorGraph.java
##########
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.optimizer;
+
+import java.io.File;
+import java.io.PrintWriter;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.hadoop.hive.ql.exec.AppMasterEventOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.ReduceSinkOperator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.rules.HivePointLookupOptimizerRule.DiGraph;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.SemiJoinBranchInfo;
+import org.apache.hadoop.hive.ql.plan.DynamicPruningEventDesc;
+
+import com.google.common.collect.Sets;
+
+public class OperatorGraph {
+
+  /**
+   * A directed graph extended with support to check dag property.
+   */
+  static class DagGraph<V, E> extends DiGraph<V, E> {

Review comment:
       Yes, I think it's a good idea to use an established graph library for 
this. I think we could create a follow-up for that and discuss the license too, 
to make sure there is no issue. Any advantage to using jgrapht over Guava graph 
implementation, given that guava is already a Hive dependency and quite widely 
used?

##########
File path: 
ql/src/test/results/clientpositive/llap/vectorized_dynamic_partition_pruning.q.out
##########
@@ -4816,34 +4816,18 @@ STAGE PLANS:
                   alias: srcpart
                   filterExpr: ds is not null (type: boolean)
                   Statistics: Num rows: 2000 Data size: 389248 Basic stats: 
COMPLETE Column stats: COMPLETE
-                  Filter Operator
-                    predicate: ds is not null (type: boolean)

Review comment:
       Filter (same as mentioned previously).

##########
File path: ql/src/test/results/clientpositive/perf/tez/constraints/query61.q.out
##########
@@ -273,6 +273,6 @@ Stage-0
                                     Select Operator [SEL_273] (rows=501694138 
width=122)
                                       
Output:["_col0","_col1","_col2","_col3","_col4"]
                                       Filter Operator [FIL_271] 
(rows=501694138 width=122)
-                                        predicate:(ss_sold_date_sk is not null 
and ss_customer_sk is not null and ss_store_sk is not null)
+                                        predicate:(ss_sold_date_sk is not null 
and ss_customer_sk is not null and ss_store_sk is not null and ss_sold_date_sk 
BETWEEN DynamicValue(RS_64_date_dim_d_date_sk_min) AND 
DynamicValue(RS_64_date_dim_d_date_sk_max) and in_bloom_filter(ss_sold_date_sk, 
DynamicValue(RS_64_date_dim_d_date_sk_bloom_filter)))

Review comment:
       new SJ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 506526)
    Time Spent: 1h 50m  (was: 1h 40m)

> Enable SharedWorkOptimizer to merge downstream operators after an 
> optimization step
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-24241
>                 URL: https://issues.apache.org/jira/browse/HIVE-24241
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24241) Enable SharedWorkOptimizer to merge downstream operators after an optimization step

Reply via email to