[jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Hive QA (JIRA) Wed, 10 Oct 2018 15:00:21 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645596#comment-16645596
 ]


Hive QA commented on HIVE-20660:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12943062/HIVE-20660.6.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 15065 tests 
executed
*Failed tests:*
{noformat}
TestAutoPurgeTables - did not produce a TEST-*.xml file (likely timed out) 
(batchId=244)
TestLocationQueries - did not produce a TEST-*.xml file (likely timed out) 
(batchId=244)
TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=194)
        
[druidmini_dynamic_partition.q,druidmini_test_ts.q,druidmini_expressions.q,druidmini_test_alter.q,druidmini_test_insert.q]
org.apache.hadoop.hive.ql.exec.spark.TestSparkSessionTimeout.testMultiSessionSparkSessionTimeout
 (batchId=246)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14355/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14355/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14355/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12943062 - PreCommit-HIVE-Build

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20660
>                 URL: https://issues.apache.org/jira/browse/HIVE-20660
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, HIVE-20660.6.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='10000', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='100000000', 
> 'rawDataSize'='10000');
> alter table t2 update statistics for column i2 
> set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>       where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 keys:
>                   0 _col1 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                 Group By Operator
>                   aggregations: count()
>                   keys: _col0 (type: int), _col1 (type: int)
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2
>                   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==========
>                   Reduce Output Operator
>                     key expressions: _col0 (type: int), _col1 (type: int)
>                     sort order: ++
>                     Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
>                     Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                     value expressions: _col2 (type: bigint)
>   .....................
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Reply via email to