Re: Review Request 68109: HIVE-20260 NDV of a column shouldn't be scaled when row count is changed by filter on another column

Zoltan Haindrich Tue, 31 Jul 2018 11:18:43 -0700


> On July 30, 2018, 6:38 p.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
> > Line 354 (original), 355 (patched)
> > <https://reviews.apache.org/r/68109/diff/1/?file=2065346#file2065346line357>
> >
> >     add a comment here:
> >     We assume columns are uncorrelated. That is filters on different 
> > columns will result in filtering out different rows. So, we scale down the 
> > ndv of a column only when row count is decreased by its own filter. Under 
> > correlated assumption, we would have scaled down ndv for every column for 
> > every filter condition. We dont do that. 
> >     This makes our estimate more conservative than need to be which is good 
> > since this will result in overestimates when we are wrong but avoids OOM 
> > had we chosen the other assumption. In future, we need to capture 
> > correlatedness of columns in metadata so that we can account for that.


added a comment about it.
yes; I aggree capturing correlations between different columns would be good - 
but there are around `|columns|**2` of them...I think Calcite has some tools 
for this..
But I currently feel that the current calculation is too much numRows centric; 
which makes it a little hard to keep track / and provide correct estimation 
logic for columns...


> On July 30, 2018, 6:38 p.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
> > Line 2633 (original), 2647 (patched)
> > <https://reviews.apache.org/r/68109/diff/1/?file=2065346#file2065346line2653>
> >
> >     Add assert newNDV <= newNumRows.

actually that would be fail; added code to clamp NDV to maxRows


- Zoltan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68109/#review206607
-----------------------------------------------------------


On July 30, 2018, 4:17 p.m., Zoltan Haindrich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68109/
> -----------------------------------------------------------
> 
> (Updated July 30, 2018, 4:17 p.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-20260
>     https://issues.apache.org/jira/browse/HIVE-20260
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> * keep track of used column; and only rescale affected columns
> * much more conservative than old logic - possible too much...
> * wip patch
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/AnnotateStatsProcCtx.java
>  47ee949fbcfa9391c640719a57fab39279c009db 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
>  3c2b0854269d5426153958096a8b5b5ad3612c0f 
>   ql/src/test/queries/clientpositive/stat_estimate_drill.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/stat_estimate_related_col.q 
> 52da2f759a009daa372a53446e2f0fd4a88152be 
>   ql/src/test/results/clientpositive/stat_estimate_drill.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/stat_estimate_related_col.q.out 
> 669adafda3a45f7846face3d99817cd1b9cb3664 
> 
> 
> Diff: https://reviews.apache.org/r/68109/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zoltan Haindrich
> 
>

Re: Review Request 68109: HIVE-20260 NDV of a column shouldn't be scaled when row count is changed by filter on another column

Reply via email to