[ https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-11794: ------------------------------------ Description: The code in Vectorizer is as such: {noformat} boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); {noformat} then, if it's reduce side: {noformat} if (isMergePartial) { // Reduce Merge-Partial GROUP BY. // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle. It is the // first (or root) operator for its reduce task. .... } else { // Reduce Hash GROUP BY or global aggregation. ... {noformat} In fact, this logic is missing the COMPLETE mode. Both from the comment: {noformat} COMPLETE: complete 1-phase aggregation: iterate, terminate ... HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation ... PARTIAL1: partial aggregation - first phase: iterate, terminatePartial {noformat} and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete): {noformat} Map Operator Tree: ... Group By Operator keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Statistics: Num rows: 273117 Data size: 22941828 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator ... feeding into Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Group By Operator aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12) keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int) mode: complete outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 {noformat} it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be). So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see. Not sure if additional code changes are necessary after that, it may just work. was: The code in Vectorizer is as such: {noformat} boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); {noformat} then, if it's reduce side: {noformat} if (isMergePartial) { // Reduce Merge-Partial GROUP BY. // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle. It is the // first (or root) operator for its reduce task. .... } else { // Reduce Hash GROUP BY or global aggregation. ... {noformat} In fact, this logic is missing the COMPLETE mode. Both from the comment: {noformat} COMPLETE: complete 1-phase aggregation: iterate, terminate ... HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation ... PARTIAL1: partial aggregation - first phase: iterate, terminatePartial {noformat} and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete): {noformat} Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Group By Operator aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12) keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int) mode: complete outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 {noformat} it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be). So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see. Not sure if additional code changes are necessary after that, it may just work. > GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly > ------------------------------------------------------------------------- > > Key: HIVE-11794 > URL: https://issues.apache.org/jira/browse/HIVE-11794 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Matt McCline > > The code in Vectorizer is as such: > {noformat} > boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); > {noformat} > then, if it's reduce side: > {noformat} > if (isMergePartial) { > // Reduce Merge-Partial GROUP BY. > // A merge-partial GROUP BY is fed by grouping by keys from > reduce-shuffle. It is the > // first (or root) operator for its reduce task. > .... > } else { > // Reduce Hash GROUP BY or global aggregation. > ... > {noformat} > In fact, this logic is missing the COMPLETE mode. Both from the comment: > {noformat} > COMPLETE: complete 1-phase aggregation: iterate, terminate > ... > HASH: For non-distinct the same as PARTIAL1 but use hash-table-based > aggregation > ... > PARTIAL1: partial aggregation - first phase: iterate, terminatePartial > {noformat} > and from the explain plan like this (the query has multiple stages of > aggregations over a union; the mapper does a partial hash aggregation for > each side of the union, which is then followed by mergepartial, and 2nd stage > as complete): > {noformat} > Map Operator Tree: > ... > Group By Operator > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), > _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: > bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), > _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > Statistics: Num rows: 273117 Data size: 22941828 Basic stats: > COMPLETE Column stats: PARTIAL > Reduce Output Operator > ... > feeding into > Reduce Operator Tree: > Group By Operator > keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: > int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), > KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), > KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: > bigint), KEY._col12 (type: bigint) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > Group By Operator > aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), > sum(_col9), sum(_col10), sum(_col11), sum(_col12) > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 > (type: int), _col4 (type: int) > mode: complete > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8, _col9, _col10, _col11, _col12 > {noformat} > it seems like COMPLETE is actually the global aggregation, and HASH isn't (or > may not be). > So, it seems like reduce-side COMPLETE should be handled on the else-path of > the above if. For map-side, it doesn't check mode at all as far as I can see. > Not sure if additional code changes are necessary after that, it may just > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)