Gopal V created HIVE-7156: ----------------------------- Summary: Group-By operator stat-annotation only uses distinct approx to generate rollups Key: HIVE-7156 URL: https://issues.apache.org/jira/browse/HIVE-7156 Project: Hive Issue Type: Bug Reporter: Gopal V
The stats annotation for a group-by only annotates the reduce-side row-count with the distinct values. The map-side gets the row-count as the rows output instead of distinct * parallelism, while the reducer side gets the correct parallelism. {code} hive> explain select distinct L_SHIPDATE from lineitem; Vertices: Map 1 Map Operator Tree: TableScan alias: lineitem Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: l_shipdate (type: string) outputColumnNames: l_shipdate Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: l_shipdate (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 5999989709 Data size: 563999032646 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 5999989709 Data size: 563999032646 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE {code} -- This message was sent by Atlassian JIRA (v6.2#6252)