[jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Hive QA (JIRA) Thu, 04 Oct 2018 06:10:10 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638217#comment-16638217
 ]


Hive QA commented on HIVE-20660:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12942300/HIVE-20660.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 15010 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col]
 (batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udaf_histogram_numeric] 
(batchId=14)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[dynamic_partition_pruning_2]
 (batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[dynamic_semijoin_user_level]
 (batchId=155)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[acid_no_buckets]
 (batchId=176)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dpp] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_partition_pruning]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_semijoin_reduction]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug]
 (batchId=181)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_partitioned]
 (batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_rewrite_no_join_opt_2]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_rewrite_part_1]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_rewrite_part_2]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[ptf] 
(batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_dpp] 
(batchId=177)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_input_counters]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorization_part]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorization_part_varchar]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_dynamic_partition_pruning]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_ptf]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[vector_join_part_col_char]
 (batchId=108)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14225/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14225/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14225/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 21 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12942300 - PreCommit-HIVE-Build

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20660
>                 URL: https://issues.apache.org/jira/browse/HIVE-20660
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='10000', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='100000000', 
> 'rawDataSize'='10000');
> alter table t2 update statistics for column i2 
> set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>       where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 keys:
>                   0 _col1 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                 Group By Operator
>                   aggregations: count()
>                   keys: _col0 (type: int), _col1 (type: int)
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2
>                   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==========
>                   Reduce Output Operator
>                     key expressions: _col0 (type: int), _col1 (type: int)
>                     sort order: ++
>                     Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
>                     Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                     value expressions: _col2 (type: bigint)
>   .....................
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Reply via email to