[impala] branch master updated: IMPALA-12006: Improve cardinality estimation for joins involving multiple conjuncts

stigahuang Wed, 12 Apr 2023 01:47:52 -0700

This is an automated email from the ASF dual-hosted git repository.

stigahuang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git



The following commit(s) were added to refs/heads/master by this push:
     new 901ce8ac4 IMPALA-12006: Improve cardinality estimation for joins 
involving multiple conjuncts
901ce8ac4 is described below

commit 901ce8ac4fd60daaa81ded2d6238163dab0c2a30
Author: Aman Sinha <[email protected]>
AuthorDate: Mon Apr 3 00:23:14 2023 -0700

    IMPALA-12006: Improve cardinality estimation for joins involving multiple 
conjuncts
    
    When an inner or outer join involves conjuncts such as the following:
     SELECT * FROM t1 inner join
      (SELECT a2, MAX(b2) as max_b2 FROM t2 GROUP BY a2) s2
      ON t1.a1 = s2.a2 AND t1.b1 = s2.max_b2
    the stats for the second conjunct involving the MAX expression gets
    added to the 'otherEqJoinStats' list. These stats were being used
    only when no other equijoin conjunct (involving base columns) was
    present but not for the above case, thus leading to over-estimation.
    
    The main change in this patch is an improvement to the cardinality
    estimate for such cases by considering the conjuncts in the
    'otherEqJoinStats list' in combination with the equijoin conjuncts.
    
    A second change is the addition of a new query option which is a
    floating point number between 0 to 1 inclusive:
      join_selectivity_correlation_factor
    It defaults to 0.0 which preserves the existing behavior of using
    the Minimum selectivity of the conjuncts. Given multiple
    join conjuncts C1,C2..Cn having comparable selectivity, we sometimes
    see over-estimation by several orders of magnitude. Setting the above
    query option to a value higher than 0 first computes the product of the
    selectivities: sel(C1) * sel(C2) .. sel(Cn)  and then scales it by
    dividing by the join_selectivity_correlation_factor.
    
    Note that this setting applies to all the joins in the query so it has
    it's trade offs. Another approach is to use per join hints but there
    are limitations of the hints approach too.
    
    Testing:
     - Added planner tests with a combination of outer and inner join with
       different values of join_selectivity_correlation_factor
     - Ran PlannerTest and TpcdsPlannerTest
    
    Change-Id: I845d778a58404af834f7501fc8157a5a4b4bcc35
    Reviewed-on: http://gerrit.cloudera.org:8080/19682
    Tested-by: Impala Public Jenkins <[email protected]>
    Reviewed-by: Kurt Deschler <[email protected]>
    Reviewed-by: Quanlong Huang <[email protected]>
---
 be/src/service/query-options.cc                    |   7 +
 be/src/service/query-options.h                     |   4 +-
 common/thrift/ImpalaService.thrift                 |   9 +
 common/thrift/Query.thrift                         |   3 +
 .../java/org/apache/impala/planner/JoinNode.java   |  69 +++++--
 .../queries/PlannerTest/card-outer-join.test       | 222 +++++++++++++++++++++
 6 files changed, 297 insertions(+), 17 deletions(-)

diff --git a/be/src/service/query-options.cc b/be/src/service/query-options.cc
index 3e98dc619..201962fdf 100644
--- a/be/src/service/query-options.cc
+++ b/be/src/service/query-options.cc
@@ -1076,6 +1076,13 @@ Status impala::SetQueryOption(const string& key, const 
string& value,
         query_options->__set_processing_cost_min_threads(min_num);
         break;
       }
+      case TImpalaQueryOptions::JOIN_SELECTIVITY_CORRELATION_FACTOR: {
+        double double_val = 0.0f;
+        RETURN_IF_ERROR(QueryOptionParser::ParseAndCheckInclusiveRange<double>(
+            option, value, 0.0, 1.0, &double_val));
+        query_options->__set_join_selectivity_correlation_factor(double_val);
+        break;
+      }
       default:
         if (IsRemovedQueryOption(key)) {
           LOG(WARNING) << "Ignoring attempt to set removed query option '" << 
key << "'";
diff --git a/be/src/service/query-options.h b/be/src/service/query-options.h
index 741888f84..2a721c9b2 100644
--- a/be/src/service/query-options.h
+++ b/be/src/service/query-options.h
@@ -50,7 +50,7 @@ typedef std::unordered_map<string, 
beeswax::TQueryOptionLevel::type>
 // time we add or remove a query option to/from the enum TImpalaQueryOptions.
 #define QUERY_OPTS_TABLE                                                       
          \
   DCHECK_EQ(_TImpalaQueryOptions_VALUES_TO_NAMES.size(),                       
          \
-      TImpalaQueryOptions::PROCESSING_COST_MIN_THREADS + 1);                   
          \
+      TImpalaQueryOptions::JOIN_SELECTIVITY_CORRELATION_FACTOR + 1);           
          \
   REMOVED_QUERY_OPT_FN(abort_on_default_limit_exceeded, 
ABORT_ON_DEFAULT_LIMIT_EXCEEDED) \
   QUERY_OPT_FN(abort_on_error, ABORT_ON_ERROR, TQueryOptionLevel::REGULAR)     
          \
   REMOVED_QUERY_OPT_FN(allow_unsupported_formats, ALLOW_UNSUPPORTED_FORMATS)   
          \
@@ -287,6 +287,8 @@ typedef std::unordered_map<string, 
beeswax::TQueryOptionLevel::type>
   QUERY_OPT_FN(                                                                
          \
       compute_processing_cost, COMPUTE_PROCESSING_COST, 
TQueryOptionLevel::ADVANCED)     \
   QUERY_OPT_FN(processing_cost_min_threads, PROCESSING_COST_MIN_THREADS,       
          \
+      TQueryOptionLevel::ADVANCED)                                             
          \
+  QUERY_OPT_FN(join_selectivity_correlation_factor, 
JOIN_SELECTIVITY_CORRELATION_FACTOR, \
       TQueryOptionLevel::ADVANCED);
 
 /// Enforce practical limits on some query options to avoid undesired query 
state.
diff --git a/common/thrift/ImpalaService.thrift 
b/common/thrift/ImpalaService.thrift
index b83708566..f4859fde1 100644
--- a/common/thrift/ImpalaService.thrift
+++ b/common/thrift/ImpalaService.thrift
@@ -774,6 +774,15 @@ enum TImpalaQueryOptions {
   // cost algorithm. It is recommend to not set it with value more than number 
of
   // physical cores in executor node. Valid values are in [1, 128]. Default to 
1.
   PROCESSING_COST_MIN_THREADS = 154;
+
+  // When calculating estimated join cardinality of 2 or more conjuncts
+  // e.g t1.a1 = t2.a2 AND t1.b1 = t2.b2, this selectivity correlation factor
+  // provides more control over the join cardinality estimation. The range is a
+  // double value between 0 to 1 inclusive. The default value of 0 preserves 
the
+  // existing behavior of using the minimum cardinality of the conjucts. 
Setting
+  // this to a value between 0 to 1 computes the combined selectivity by
+  // taking the product of the selectivities and dividing by this factor.
+  JOIN_SELECTIVITY_CORRELATION_FACTOR = 155;
 }
 
 // The summary of a DML statement.
diff --git a/common/thrift/Query.thrift b/common/thrift/Query.thrift
index b190f2f11..e04c20779 100644
--- a/common/thrift/Query.thrift
+++ b/common/thrift/Query.thrift
@@ -624,6 +624,9 @@ struct TQueryOptions {
 
   // See comment in ImpalaService.thrift
   155: optional i32 processing_cost_min_threads = 1;
+
+  // See comment in ImpalaService.thrift
+  156: optional double join_selectivity_correlation_factor = 0.0;
 }
 
 // Impala currently has three types of sessions: Beeswax, HiveServer2 and 
external
diff --git a/fe/src/main/java/org/apache/impala/planner/JoinNode.java 
b/fe/src/main/java/org/apache/impala/planner/JoinNode.java
index 02d39e991..005ff1e05 100644
--- a/fe/src/main/java/org/apache/impala/planner/JoinNode.java
+++ b/fe/src/main/java/org/apache/impala/planner/JoinNode.java
@@ -27,7 +27,6 @@ import org.apache.impala.analysis.AnalyticExpr;
 import org.apache.impala.analysis.Analyzer;
 import org.apache.impala.analysis.BinaryPredicate;
 import org.apache.impala.analysis.Expr;
-import org.apache.impala.analysis.FunctionCallExpr;
 import org.apache.impala.analysis.JoinOperator;
 import org.apache.impala.analysis.SlotDescriptor;
 import org.apache.impala.analysis.SlotRef;
@@ -254,15 +253,19 @@ public abstract class JoinNode extends PlanNode {
    * estimation is very similar to the generic estimator.
    *
    * Once the estimation method has been determined we compute the final 
cardinality
-   * based on the single most selective join predicate. We do not attempt to 
estimate
-   * the joint selectivity of multiple join predicates to avoid 
underestimation.
+   * based on the single most selective join predicate unless the query option
+   * join_selectivity_correlation_factor is set to a value higher than 0 (up 
to 1). In
+   * that case, we compute the combined selectivity by taking the product of 
the
+   * selectivities and dividing by the correlation factor. Otherwise, for the 
default
+   * value of 0 we do not attempt to estimate the joint selectivity of 
multiple join
+   * predicates to avoid underestimation.
    * The FK/PK detection logic is based on the assumption that most joins are 
FK/PK. We
    * only use the generic estimation method if we have high confidence that 
there is no
    * FK/PK relationship. In the absence of relevant stats, we assume FK/PK 
with a join
    * selectivity of 1.
    *
    * In some cases where a function is involved in the join predicate - e.g c 
= max(d),
-   * the RHS may have relevant stats. For instance if it is s scalar subquery, 
the RHS
+   * the RHS may have relevant stats. For instance if it is a scalar subquery, 
the RHS
    * NDV = 1. Whenever such stats are available, we classify them into an 
'other'
    * conjuncts list and leverage the available stats. We use the same 
estimation
    * formula as the generic estimator.
@@ -313,8 +316,9 @@ public abstract class JoinNode extends PlanNode {
     }
 
     if (eqJoinConjunctSlots.isEmpty()) {
-      if (!otherEqJoinStats.isEmpty() && joinOp_.isInnerJoin()) {
-        return getGenericJoinCardinality2(otherEqJoinStats, lhsCard, rhsCard);
+      if (!otherEqJoinStats.isEmpty()
+              && (joinOp_.isInnerJoin() || joinOp_.isOuterJoin())) {
+        return getGenericJoinCardinality2(otherEqJoinStats, lhsCard, rhsCard, 
analyzer);
       } else {
         // There are no eligible equi-join conjuncts. Optimistically assume 
FK/PK with a
         // join selectivity of 1.
@@ -326,7 +330,8 @@ public abstract class JoinNode extends PlanNode {
     if (fkPkEqJoinConjuncts_ != null) {
       return getFkPkJoinCardinality(fkPkEqJoinConjuncts_, lhsCard, rhsCard);
     } else {
-      return getGenericJoinCardinality(eqJoinConjunctSlots, lhsCard, rhsCard);
+      return getGenericJoinCardinality(eqJoinConjunctSlots, otherEqJoinStats, 
lhsCard,
+         rhsCard, analyzer);
     }
   }
 
@@ -397,25 +402,48 @@ public abstract class JoinNode extends PlanNode {
 
   /**
    * Returns the estimated join cardinality of a generic N:M inner or outer 
join based
-   * on the given list of equi-join conjunct slots and the join input 
cardinalities.
+   * on the given list of equi-join conjunct slots, other equi-join conjunct 
stats
+   * and the join input cardinalities.
    * The returned result is >= 0.
    * The list of join conjuncts must be non-empty and the cardinalities must 
be >= 0.
    */
   private long getGenericJoinCardinality(List<EqJoinConjunctScanSlots> 
eqJoinConjunctSlots,
-      long lhsCard, long rhsCard) {
+      List<NdvAndRowCountStats> otherEqJoinStats, long lhsCard, long rhsCard,
+      Analyzer analyzer) {
     Preconditions.checkState(joinOp_.isInnerJoin() || joinOp_.isOuterJoin());
     Preconditions.checkState(!eqJoinConjunctSlots.isEmpty());
 
-    long result = -1;
+    List<Long> joinCardList = new ArrayList<>();
+
+    // first collect all the join cardinalities
     for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
-      long joinCard = getGenericJoinCardinalityInternal(slots.lhsNdv(), 
slots.rhsNdv(),
-          slots.lhsNumRows(), slots.rhsNumRows(), lhsCard, rhsCard);
+      joinCardList.add(getGenericJoinCardinalityInternal(slots.lhsNdv(), 
slots.rhsNdv(),
+          slots.lhsNumRows(), slots.rhsNumRows(), lhsCard, rhsCard));
+    }
+
+    if (!otherEqJoinStats.isEmpty()
+            && (joinOp_.isInnerJoin() || joinOp_.isOuterJoin())) {
+      joinCardList.add(getGenericJoinCardinality2(otherEqJoinStats, lhsCard,
+        rhsCard, analyzer));
+    }
+    long result = -1;
+    double corrfactor =
+      analyzer.getQueryOptions().getJoin_selectivity_correlation_factor();
+    double cumulativeSel = 1.0;
+
+    // Apply the selectivity formulas based on the query options or use default
+    for (Long joinCard : joinCardList) {
       if (result == -1) {
         result = joinCard;
       } else {
         result = Math.min(result, joinCard);
       }
+      if (corrfactor > 0) cumulativeSel *= (((double) 
joinCard/lhsCard)/rhsCard);
     }
+    if (corrfactor > 0) {
+      result = (long) Math.min(result, ((cumulativeSel * lhsCard) * 
rhsCard)/corrfactor);
+    }
+
     Preconditions.checkState(result >= 0);
     return result;
   }
@@ -445,15 +473,18 @@ public abstract class JoinNode extends PlanNode {
    * This function mirrors the logic for {@link 
JoinNode#getGenericJoinCardinality} except
    * that instead of the EqJoinConjunctScanSlots, it uses the {@link 
NdvAndRowCountStats}
    * to directly access stats that were pre-populated. Currently, this 
function is
-   * restricted to inner joins. In order to extend it to outer joins some more 
analysis is
-   * needed to ensure it works correctly for different types of outer joins.
+   * restricted to inner and outer joins.
+   * TODO: check if applicable for anti and semi joins
    */
   private long getGenericJoinCardinality2(List<NdvAndRowCountStats> statsList,
-      long lhsCard, long rhsCard) {
-    Preconditions.checkState(joinOp_.isInnerJoin());
+      long lhsCard, long rhsCard, Analyzer analyzer) {
+    Preconditions.checkState(joinOp_.isInnerJoin() || joinOp_.isOuterJoin());
     Preconditions.checkState(!statsList.isEmpty());
 
     long result = -1;
+    double corrfactor =
+      analyzer.getQueryOptions().getJoin_selectivity_correlation_factor();
+    double cumulativeSel = 1.0;
     for (NdvAndRowCountStats stats: statsList) {
       long joinCard = getGenericJoinCardinalityInternal(stats.lhsNdv(), 
stats.rhsNdv(),
           stats.lhsNumRows(), stats.rhsNumRows(), lhsCard, rhsCard);
@@ -462,7 +493,13 @@ public abstract class JoinNode extends PlanNode {
       } else {
         result = Math.min(result, joinCard);
       }
+      if (corrfactor > 0) cumulativeSel *= (((double) 
joinCard/lhsCard)/rhsCard);
     }
+
+    if (corrfactor > 0) {
+      result = (long) Math.min(result, ((cumulativeSel * lhsCard) * 
rhsCard)/corrfactor);
+    }
+
     Preconditions.checkState(result >= 0);
     return result;
   }
diff --git 
a/testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
 
b/testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
index d54e30733..27840b9eb 100644
--- 
a/testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
+++ 
b/testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
@@ -810,3 +810,225 @@ PLAN-ROOT SINK
    predicates: o.o_clerk < 'foo'
    row-size=47B cardinality=150.00K
 ====
+# IMPALA-12006 (1) Top join is outer join whose
+# child is inner join that produces duplicates on
+# the join key.
+# Actual runtime join cardinalities:
+# 05: HASH JOIN [LEFT OUTER JOIN] 2.01M
+# 04: HASH JOIN [INNER JOIN]  1.34K
+select * from
+tpch.orders a left outer join
+ (select a.o_clerk from tpch.orders a inner join
+  (select o_clerk, max(cast(o_orderdate as DATE)) max_date
+    from tpch.orders where o_orderdate < DATE '1998-01-01'
+    group by o_clerk) t1
+   on a.o_clerk = t1.o_clerk AND a.o_orderdate = max_date) t2
+on a.o_clerk = t2.o_clerk
+---- PLAN
+PLAN-ROOT SINK
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: a.o_clerk = a.o_clerk
+|  row-size=251B cardinality=911.67M
+|
+|--04:HASH JOIN [INNER JOIN]
+|  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), 
a.o_clerk = o_clerk
+|  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- 
o_clerk
+|  |  row-size=80B cardinality=611.43K
+|  |
+|  |--03:AGGREGATE [FINALIZE]
+|  |  |  output: max(CAST(o_orderdate AS DATE))
+|  |  |  group by: o_clerk
+|  |  |  row-size=31B cardinality=1.01K
+|  |  |
+|  |  02:SCAN HDFS [tpch.orders]
+|  |     HDFS partitions=1/1 files=1 size=162.56MB
+|  |     predicates: o_orderdate < DATE '1998-01-01'
+|  |     row-size=49B cardinality=150.00K
+|  |
+|  01:SCAN HDFS [tpch.orders a]
+|     HDFS partitions=1/1 files=1 size=162.56MB
+|     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_clerk
+|     row-size=49B cardinality=1.50M
+|
+00:SCAN HDFS [tpch.orders a]
+   HDFS partitions=1/1 files=1 size=162.56MB
+   row-size=171B cardinality=1.50M
+====
+# IMPALA-12006 (2) Same as above but with the
+# join_selectivity_correlation_factor set to 1.0.
+# See (1) for actual runtime join cardinalities.
+select * from
+tpch.orders a left outer join
+ (select a.o_clerk from tpch.orders a inner join
+  (select o_clerk, max(cast(o_orderdate as DATE)) max_date
+    from tpch.orders where o_orderdate < DATE '1998-01-01'
+    group by o_clerk) t1
+   on a.o_clerk = t1.o_clerk AND a.o_orderdate = max_date) t2
+on a.o_clerk = t2.o_clerk
+---- QUERYOPTIONS
+join_selectivity_correlation_factor=1.0
+---- PLAN
+PLAN-ROOT SINK
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: a.o_clerk = a.o_clerk
+|  row-size=251B cardinality=1.50M
+|
+|--04:HASH JOIN [INNER JOIN]
+|  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), 
a.o_clerk = o_clerk
+|  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- 
o_clerk
+|  |  row-size=80B cardinality=607
+|  |
+|  |--03:AGGREGATE [FINALIZE]
+|  |  |  output: max(CAST(o_orderdate AS DATE))
+|  |  |  group by: o_clerk
+|  |  |  row-size=31B cardinality=1.01K
+|  |  |
+|  |  02:SCAN HDFS [tpch.orders]
+|  |     HDFS partitions=1/1 files=1 size=162.56MB
+|  |     predicates: o_orderdate < DATE '1998-01-01'
+|  |     row-size=49B cardinality=150.00K
+|  |
+|  01:SCAN HDFS [tpch.orders a]
+|     HDFS partitions=1/1 files=1 size=162.56MB
+|     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_clerk
+|     row-size=49B cardinality=1.50M
+|
+00:SCAN HDFS [tpch.orders a]
+   HDFS partitions=1/1 files=1 size=162.56MB
+   row-size=171B cardinality=1.50M
+====
+# IMPALA-12006 (3) Fractional join_selectivity_correlation_factor
+# See (1) for actual runtime join cardinalitites.
+select * from
+tpch.orders a left outer join
+ (select a.o_clerk from tpch.orders a inner join
+  (select o_clerk, max(cast(o_orderdate as DATE)) max_date
+    from tpch.orders where o_orderdate < DATE '1998-01-01'
+    group by o_clerk) t1
+   on a.o_clerk = t1.o_clerk AND a.o_orderdate = max_date) t2
+on a.o_clerk = t2.o_clerk
+---- QUERYOPTIONS
+join_selectivity_correlation_factor=0.25
+---- PLAN
+PLAN-ROOT SINK
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: a.o_clerk = a.o_clerk
+|  row-size=251B cardinality=3.62M
+|
+|--04:HASH JOIN [INNER JOIN]
+|  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), 
a.o_clerk = o_clerk
+|  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- 
o_clerk
+|  |  row-size=80B cardinality=2.43K
+|  |
+|  |--03:AGGREGATE [FINALIZE]
+|  |  |  output: max(CAST(o_orderdate AS DATE))
+|  |  |  group by: o_clerk
+|  |  |  row-size=31B cardinality=1.01K
+|  |  |
+|  |  02:SCAN HDFS [tpch.orders]
+|  |     HDFS partitions=1/1 files=1 size=162.56MB
+|  |     predicates: o_orderdate < DATE '1998-01-01'
+|  |     row-size=49B cardinality=150.00K
+|  |
+|  01:SCAN HDFS [tpch.orders a]
+|     HDFS partitions=1/1 files=1 size=162.56MB
+|     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_clerk
+|     row-size=49B cardinality=1.50M
+|
+00:SCAN HDFS [tpch.orders a]
+   HDFS partitions=1/1 files=1 size=162.56MB
+   row-size=171B cardinality=1.50M
+====
+# IMPALA-12006 (4) Similar to above but join is only on
+# the aggregate expressions, not on the base column.
+# Actual runtime cardinalities:
+# 05: HASH JOIN [LEFT OUTER JOIN] 172.52M
+# 04: HASH JOIN [INNER JOIN] 114.67K
+select * from
+tpch.orders a left outer join
+ (select a.o_clerk from tpch.orders a inner join
+  (select o_clerk, max(cast(o_orderdate as DATE)) max_date,
+    min(o_orderpriority) as min_priority
+    from tpch.orders where o_orderdate < DATE '1998-01-01'
+    group by o_clerk) t1
+   on a.o_orderdate = max_date and a.o_orderpriority = min_priority) t2
+on a.o_clerk = t2.o_clerk
+---- PLAN
+PLAN-ROOT SINK
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: a.o_clerk = a.o_clerk
+|  row-size=283B cardinality=911.67M
+|
+|--04:HASH JOIN [INNER JOIN]
+|  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), 
a.o_orderpriority = min(o_orderpriority)
+|  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- 
min(o_orderpriority)
+|  |  row-size=112B cardinality=611.43K
+|  |
+|  |--03:AGGREGATE [FINALIZE]
+|  |  |  output: max(CAST(o_orderdate AS DATE)), min(o_orderpriority)
+|  |  |  group by: o_clerk
+|  |  |  row-size=43B cardinality=1.01K
+|  |  |
+|  |  02:SCAN HDFS [tpch.orders]
+|  |     HDFS partitions=1/1 files=1 size=162.56MB
+|  |     predicates: o_orderdate < DATE '1998-01-01'
+|  |     row-size=69B cardinality=150.00K
+|  |
+|  01:SCAN HDFS [tpch.orders a]
+|     HDFS partitions=1/1 files=1 size=162.56MB
+|     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_orderpriority
+|     row-size=69B cardinality=1.50M
+|
+00:SCAN HDFS [tpch.orders a]
+   HDFS partitions=1/1 files=1 size=162.56MB
+   row-size=171B cardinality=1.50M
+====
+# IMPALA-12006 (5) Similar to (4) above but with
+# join_selectivity_correlation_factor set to 1.0.
+# See (4) for actual runtime join cardinalities.
+select * from
+tpch.orders a left outer join
+ (select a.o_clerk from tpch.orders a inner join
+  (select o_clerk, max(cast(o_orderdate as DATE)) max_date,
+    min(o_orderpriority) as min_priority
+    from tpch.orders where o_orderdate < DATE '1998-01-01'
+    group by o_clerk) t1
+   on a.o_orderdate = max_date and a.o_orderpriority = min_priority) t2
+on a.o_clerk = t2.o_clerk
+---- QUERYOPTIONS
+join_selectivity_correlation_factor=1.0
+---- PLAN
+PLAN-ROOT SINK
+|
+05:HASH JOIN [LEFT OUTER JOIN]
+|  hash predicates: a.o_clerk = a.o_clerk
+|  row-size=283B cardinality=182.33M
+|
+|--04:HASH JOIN [INNER JOIN]
+|  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), 
a.o_orderpriority = min(o_orderpriority)
+|  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- 
min(o_orderpriority)
+|  |  row-size=112B cardinality=122.28K
+|  |
+|  |--03:AGGREGATE [FINALIZE]
+|  |  |  output: max(CAST(o_orderdate AS DATE)), min(o_orderpriority)
+|  |  |  group by: o_clerk
+|  |  |  row-size=43B cardinality=1.01K
+|  |  |
+|  |  02:SCAN HDFS [tpch.orders]
+|  |     HDFS partitions=1/1 files=1 size=162.56MB
+|  |     predicates: o_orderdate < DATE '1998-01-01'
+|  |     row-size=69B cardinality=150.00K
+|  |
+|  01:SCAN HDFS [tpch.orders a]
+|     HDFS partitions=1/1 files=1 size=162.56MB
+|     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_orderpriority
+|     row-size=69B cardinality=1.50M
+|
+00:SCAN HDFS [tpch.orders a]
+   HDFS partitions=1/1 files=1 size=162.56MB
+   row-size=171B cardinality=1.50M
+====

[impala] branch master updated: IMPALA-12006: Improve cardinality estimation for joins involving multiple conjuncts

Reply via email to