Sergey Shelukhin created HIVE-11794:
---------------------------------------
Summary: GBY vectorization appears to process COMPLETE reduce-side
GBY incorrectly
Key: HIVE-11794
URL: https://issues.apache.org/jira/browse/HIVE-11794
Project: Hive
Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Matt McCline
The code in Vectorizer is as such:
{noformat}
boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
{noformat}
then, if it's reduce side:
{noformat}
if (isMergePartial) {
// Reduce Merge-Partial GROUP BY.
// A merge-partial GROUP BY is fed by grouping by keys from
reduce-shuffle. It is the
// first (or root) operator for its reduce task.
....
} else {
// Reduce Hash GROUP BY or global aggregation.
...
{noformat}
In fact, the comments are missing the COMPLETE mode. Both from the comment:
{noformat}
COMPLETE: complete 1-phase aggregation: iterate, terminate
...
HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation
...
PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
{noformat}
and from the explain plan like this (the query has multiple stages of
aggregations over a union; the mapper does a partial hash aggregation for each
side of the union, which is then followed by mergepartial, and 2nd stage as
complete):
{noformat}
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int),
KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint),
KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint),
KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint),
KEY._col12 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7,
_col8, _col9, _col10, _col11, _col12
Group By Operator
aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9),
sum(_col10), sum(_col11), sum(_col12)
keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3
(type: int), _col4 (type: int)
mode: complete
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6,
_col7, _col8, _col9, _col10, _col11, _col12
{noformat}
it seems like COMPLETE is actually the global aggregation, and HASH isn't (or
may not be).
So, it seems like reduce-side COMPLETE should be handled on the else-path of
the above if. For map-side, it doesn't check mode at all as far as I can see.
Not sure if additional code changes are necessary after that, it may just work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)