[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562680#comment-16562680
 ] 

Ashutosh Chauhan commented on HIVE-20260:
-----------------------------------------

I am not sure this mechanism of {{affectedColumns}} will help. IIUC, effect of 
this is that column stats will get updated only for columns involved in filter 
expression. Columns which are part of filter operator but are absent from 
expressions will have their column stats unchanged. If so, that is not what we 
want. What we want is to update stats for all columns of filter operator which 
is the case before this patch but change the logic of how they are updated. 
Before the patch, every condition of filter results in scale down of col stats. 
That is we assumed each filter condition independently filters out diff rows. 
However, If instead we assume that diff filter conditions filter overlapping 
rows than we scale down col stats of column involved in condition only with row 
count decreased by it. Columns not involved in condition we can scale down col 
stats by max decrease of one of conditions.
Perhaps, an example will help. Lets say we have :
{code}
create table t1 (a int, b int, c int);
-- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200
select a,b,c from t1 where a = 20 and b = 30;
{code}

Here, when we process a = 20; 
rowcount = 500 / 2 = 250. ndv(a) = 100 * (250/500) = 50. ndv(b) = 50 ndv(c) = 
200. b and c's ndv unchanged.
Then we process b = 30.
rowcount = 250/2 = 125. ndv(b) = 50 * (125/250) = 25. ndv(a) = 50 ndv(c) = 200. 
a and c's ndv unchanged.

For b and c we are done since we updated their column stats. For c (columns not 
included in filter condition) we updated with largest factor change brought. 
Here that means 200 * (1/2) = 100

Logic before this patch would have resulted in change of (125/500) = 1/4 ndv 
for every column. 

Apart from above, second issue is this scaling happen twice: once when filter 
expression is processed [1] and then when operator stats are updated[2] . That 
looks incorrect we should perhaps remove one of these calls.
[1] : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L355
[2] : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L299

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-20260
>                 URL: https://issues.apache.org/jira/browse/HIVE-20260
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>            Reporter: Ashutosh Chauhan
>            Assignee: Zoltan Haindrich
>            Priority: Major
>         Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to