konstantinb commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2771730406


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +41,15 @@ public void add(ColStatistics stat) {
       if (stat.getAvgColLen() > result.getAvgColLen()) {
         result.setAvgColLen(stat.getAvgColLen());
       }
-      if (stat.getCountDistint() > result.getCountDistint()) {
-        result.setCountDistint(stat.getCountDistint());
-      }
+
+      // NDVs can only be accurately combined if full information about 
columns, query branches and
+      // their relationships is available. Without that info, there is only 
one "truly conservative"
+      // value of NDV which is 0, which means that the NDV is unknown. It 
forces optimizer
+      // to make the most conservative decisions possible, which is the exact 
goal of
+      // PessimisticStatCombiner. It does inflate statistics in multiple 
cases, but at the same time it
+      // also ensures than the query execution does not "blow up" due to too 
optimistic stats estimates
+      result.setCountDistint(0L);

Review Comment:
   Edit: per the PR feedback, this has been refined to only set NDV to 
"Unknown" if either part of the combined values is also "Unknown", resulting in 
much better estimates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to