[jira] [Updated] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

Sergey Shelukhin (JIRA) Thu, 10 Sep 2015 17:52:10 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-11794:
------------------------------------
    Description: 
The code in Vectorizer is as such:
{noformat}
    boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
{noformat}
then, if it's reduce side:
{noformat}
    if (isMergePartial) {
        // Reduce Merge-Partial GROUP BY.
        // A merge-partial GROUP BY is fed by grouping by keys from 
reduce-shuffle.  It is the
        // first (or root) operator for its reduce task.
....
      } else {
        // Reduce Hash GROUP BY or global aggregation.
...
{noformat}

In fact, this logic is missing the COMPLETE mode. Both from the comment:
{noformat}
 COMPLETE: complete 1-phase aggregation: iterate, terminate
...
HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation
...
PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
{noformat}

and from the explain plan like this (the query has multiple stages of 
aggregations over a union; the mapper does a partial hash aggregation for each 
side of the union, which is then followed by mergepartial, and 2nd stage as 
complete):
{noformat}
Map Operator Tree:
...
        Group By Operator
          keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
(type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), 
_col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: 
bigint), _col11 (type: bigint), _col12 (type: bigint)
          mode: hash
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8, _col9, _col10, _col11, _col12
          Statistics: Num rows: 273117 Data size: 22941828 Basic stats: 
COMPLETE Column stats: PARTIAL
          Reduce Output Operator
...
feeding into

Reduce Operator Tree:
  Group By Operator
    keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), 
KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), 
KEY._col12 (type: bigint)
    mode: mergepartial
    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, 
_col8, _col9, _col10, _col11, _col12
    Group By Operator
      aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), 
sum(_col10), sum(_col11), sum(_col12)
      keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
(type: int), _col4 (type: int)
      mode: complete
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8, _col9, _col10, _col11, _col12
{noformat}

it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
may not be).
So, it seems like reduce-side COMPLETE should be handled on the else-path of 
the above if. For map-side, it doesn't check mode at all as far as I can see.
Not sure if additional code changes are necessary after that, it may just work.

  was:
The code in Vectorizer is as such:
{noformat}
    boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
{noformat}
then, if it's reduce side:
{noformat}
    if (isMergePartial) {
        // Reduce Merge-Partial GROUP BY.
        // A merge-partial GROUP BY is fed by grouping by keys from 
reduce-shuffle.  It is the
        // first (or root) operator for its reduce task.
....
      } else {
        // Reduce Hash GROUP BY or global aggregation.
...
{noformat}

In fact, this logic is missing the COMPLETE mode. Both from the comment:
{noformat}
 COMPLETE: complete 1-phase aggregation: iterate, terminate
...
HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation
...
PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
{noformat}

and from the explain plan like this (the query has multiple stages of 
aggregations over a union; the mapper does a partial hash aggregation for each 
side of the union, which is then followed by mergepartial, and 2nd stage as 
complete):
{noformat}
Reduce Operator Tree:
  Group By Operator
    keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), 
KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), 
KEY._col12 (type: bigint)
    mode: mergepartial
    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, 
_col8, _col9, _col10, _col11, _col12
    Group By Operator
      aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), 
sum(_col10), sum(_col11), sum(_col12)
      keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
(type: int), _col4 (type: int)
      mode: complete
      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8, _col9, _col10, _col11, _col12
{noformat}

it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
may not be).
So, it seems like reduce-side COMPLETE should be handled on the else-path of 
the above if. For map-side, it doesn't check mode at all as far as I can see.
Not sure if additional code changes are necessary after that, it may just work.


> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -------------------------------------------------------------------------
>
>                 Key: HIVE-11794
>                 URL: https://issues.apache.org/jira/browse/HIVE-11794
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Matt McCline
>
> The code in Vectorizer is as such:
> {noformat}
>     boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
>     if (isMergePartial) {
>         // Reduce Merge-Partial GROUP BY.
>         // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
>         // first (or root) operator for its reduce task.
> ....
>       } else {
>         // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
>         Group By Operator
>           keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>           mode: hash
>           outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>           Statistics: Num rows: 273117 Data size: 22941828 Basic stats: 
> COMPLETE Column stats: PARTIAL
>           Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
>     keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
>     mode: mergepartial
>     outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>     Group By Operator
>       aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>       keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>       mode: complete
>       outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

Reply via email to