okumin commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2767157356


##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java:
##########
@@ -144,6 +148,53 @@ static class WhenStatEstimator implements StatEstimator {
 
     @Override
     public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+      return estimate(argStats, null);
+    }
+
+    @Override
+    public Optional<ColStatistics> estimate(List<ColStatistics> argStats, 
List<ExprNodeDesc> argExprs) {

Review Comment:
   I'm clarifying my understanding. Please let me know if I'm overlooking 
something.
   Let's assume the number of distinct values of `col_2` is 2, that of 
`col_100` is 100, and that of `col_999` is 999.
   
   The true NDV of the following expression is 3. The original implementation 
returns 1, and this implementation returns 3.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE 'CODE_ELSE'
   END
   ```
   
   That of this is 2. The original implementation returns 1, and this 
implementation returns 2.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE 'CODE_01'
   END
   ```
   
   That of this is 100, 101, or 102. The original implementation returns 100, 
and this implementation returns 100.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
     ELSE col_100
   END
   ```
   
   That of this is 999 ~ 1100. The original implementation returns 999, and 
this implementation returns 999.
   
   ```sql
   CASE
     WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
     WHEN category BETWEEN 5 AND 9 THEN col_999
     ELSE col_100
   END
   ```
   
   That of this is 6 ~ 8. The original implementation returns 2, and this 
implementation returns 2.
   
   ```sql
   CASE
       WHEN category BETWEEN 0 AND 4 THEN 'CODE_00'
       WHEN category BETWEEN 5 AND 9 THEN 'CODE_01'
       WHEN category BETWEEN 10 AND 14 THEN 'CODE_02'
       WHEN category BETWEEN 15 AND 19 THEN 'CODE_03'
       WHEN category BETWEEN 20 AND 24 THEN 'CODE_04'
       WHEN category BETWEEN 25 AND 29 THEN 'CODE_05'
     ELSE col_2
   END
   ```
   
   I'd say the current patch does not introduce worse estimation in any case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to