Re: Review Request 68109: HIVE-20260 NDV of a column shouldn't be scaled when row count is changed by filter on another column

Ashutosh Chauhan Mon, 30 Jul 2018 11:38:52 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68109/#review206607
-----------------------------------------------------------





ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
Line 354 (original), 355 (patched)
<https://reviews.apache.org/r/68109/#comment289636>

    add a comment here:
    We assume columns are uncorrelated. That is filters on different columns 
will result in filtering out different rows. So, we scale down the ndv of a 
column only when row count is decreased by its own filter. Under correlated 
assumption, we would have scaled down ndv for every column for every filter 
condition. We dont do that. 
    This makes our estimate more conservative than need to be which is good 
since this will result in overestimates when we are wrong but avoids OOM had we 
chosen the other assumption. In future, we need to capture correlatedness of 
columns in metadata so that we can account for that.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
Line 2633 (original), 2647 (patched)
<https://reviews.apache.org/r/68109/#comment289635>

    Add assert newNDV <= newNumRows.


- Ashutosh Chauhan


On July 30, 2018, 4:17 p.m., Zoltan Haindrich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68109/
> -----------------------------------------------------------
> 
> (Updated July 30, 2018, 4:17 p.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-20260
>     https://issues.apache.org/jira/browse/HIVE-20260
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> * keep track of used column; and only rescale affected columns
> * much more conservative than old logic - possible too much...
> * wip patch
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/AnnotateStatsProcCtx.java
>  47ee949fbcfa9391c640719a57fab39279c009db 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
>  3c2b0854269d5426153958096a8b5b5ad3612c0f 
>   ql/src/test/queries/clientpositive/stat_estimate_drill.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/stat_estimate_related_col.q 
> 52da2f759a009daa372a53446e2f0fd4a88152be 
>   ql/src/test/results/clientpositive/stat_estimate_drill.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/stat_estimate_related_col.q.out 
> 669adafda3a45f7846face3d99817cd1b9cb3664 
> 
> 
> Diff: https://reviews.apache.org/r/68109/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zoltan Haindrich
> 
>

Re: Review Request 68109: HIVE-20260 NDV of a column shouldn't be scaled when row count is changed by filter on another column

Reply via email to