[ https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858362#comment-13858362 ]
Hive QA commented on HIVE-6120: ------------------------------- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12620770/HIVE-6120.1.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 4818 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_8 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2 org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3 {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12620770 > Add GroupBy optimization to eliminate un-needed partial distinct aggregations > ----------------------------------------------------------------------------- > > Key: HIVE-6120 > URL: https://issues.apache.org/jira/browse/HIVE-6120 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Sun Rui > Assignee: Sun Rui > Attachments: HIVE-6120.1.patch > > > In most cases, partial distinct aggregation is not needed in map-side > groupby. The exception is that with sorted bucketized tables partial distinct > aggregation can be done by the mappers in some scenarios, as what is done by > GroupByOptimzer. > Currently, partial distinct aggregation is done in the map-side GroupBy and > then shuffle of the partial result is done in the following ReduceSink > operator, in cases where they are not needed. This wastes CPU cycles, memory > and network bandwidth. > This optimization eliminates un-needed partial distinct aggregations, which > improves performance and reduces memory usage. > For example, > EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key; > Before optimization: > {noformat} > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col2 > type: bigint > {noformat} > After optimization: > {noformat} > Group By Operator > bucketGroup: false > keys: > expr: key > type: int > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)