[ https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stamatis Zampetakis updated HIVE-23485: --------------------------------------- Fix Version/s: 4.0.0 Release Note: Fixed in [47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082]. Thanks for the review @kgyrtkirk! Resolution: Fixed Status: Resolved (was: Patch Available) > Bound GroupByOperator stats using largest NDV among columns > ----------------------------------------------------------- > > Key: HIVE-23485 > URL: https://issues.apache.org/jira/browse/HIVE-23485 > Project: Hive > Issue Type: Improvement > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Consider the following SQL query: > {code:sql} > select id, name from person group by id, name; > {code} > and assume that the person table contains the following tuples: > {code:sql} > insert into person values (0, 'A') ; > insert into person values (1, 'A') ; > insert into person values (2, 'B') ; > insert into person values (3, 'B') ; > insert into person values (4, 'B') ; > insert into person values (5, 'C') ; > {code} > If we know the number of distinct values (NDV) for all columns in the group > by clause then we can infer a lower bound for the total number of rows by > taking the maximun NDV of the involved columns. > Currently the query in the scenario above has the following plan: > {noformat} > Vertex dependency in root stage > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Stage-0 > Fetch Operator > limit:-1 > Stage-1 > Reducer 2 vectorized > File Output Operator [FS_11] > Group By Operator [GBY_10] (rows=3 width=92) > Output:["_col0","_col1"],keys:KEY._col0, KEY._col1 > <-Map 1 [SIMPLE_EDGE] vectorized > SHUFFLE [RS_9] > PartitionCols:_col0, _col1 > Group By Operator [GBY_8] (rows=3 width=92) > Output:["_col0","_col1"],keys:id, name > Select Operator [SEL_7] (rows=6 width=92) > Output:["id","name"] > TableScan [TS_0] (rows=6 width=92) > > default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat} > Observe that the stats for group by report 3 rows but given that the ID > attribute is part of the aggregation the rows cannot be less than 6. -- This message was sent by Atlassian Jira (v8.3.4#803005)