[jira] [Commented] (HIVE-6120) Add GroupBy optimization to eliminate un-needed partial distinct aggregations

Hive QA (JIRA) Sun, 29 Dec 2013 07:42:31 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858362#comment-13858362
 ]


Hive QA commented on HIVE-6120:
-------------------------------



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12620770/HIVE-6120.1.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 4818 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_8
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12620770

> Add GroupBy optimization to eliminate un-needed partial distinct aggregations
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-6120
>                 URL: https://issues.apache.org/jira/browse/HIVE-6120
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-6120.1.patch
>
>
> In most cases, partial distinct aggregation is not needed in map-side 
> groupby. The exception is that with sorted bucketized tables partial distinct 
> aggregation can be done by the mappers in some scenarios, as what is done by 
> GroupByOptimzer.
> Currently, partial distinct aggregation is done in the map-side GroupBy and 
> then shuffle of the partial result is done in the following ReduceSink 
> operator, in cases where they are not needed. This wastes CPU cycles, memory 
> and network bandwidth.
> This optimization eliminates un-needed partial distinct aggregations, which 
> improves performance and reduces memory usage.
> For example,
> EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
> Before optimization:
> {noformat}
>               Group By Operator
>                 aggregations:
>                       expr: count(DISTINCT value)
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1, _col2
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
>                   value expressions:
>                         expr: _col2
>                         type: bigint
> {noformat}
> After optimization:
> {noformat}
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: key
>                       type: int
>                       expr: value
>                       type: string
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: int
>                         expr: _col1
>                         type: string
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: int
>                   tag: -1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6120) Add GroupBy optimization to eliminate un-needed partial distinct aggregations

Reply via email to