[
https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174730#comment-13174730
]
Phabricator commented on HIVE-2621:
-----------------------------------
kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple
group bys with the same input data and spray keys to be run on the same
reducer.".
INLINE COMMENTS
ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12
I actually had it in one MR job originally, but the addition of the limits
caused the addition of an extra job for each insert. The code to do that is at
line 6379 of the SemanticAnalyzer. I assume it does this to ensure all the
rows go to a single reducer, so the limit can be enforced.
As for making hive.multigroupby.singlereducer true by default, currently map
aggregation is set to true by default, and setting this new variable causes the
map aggregation variable to be ignored for any group by that can go to the same
reducer as at least one other.
I'll turn it on by default, I just wanted to make that clear.
Also, in that case this test isn't needed as I just copied another test and
turned on hive.multigroupby.singlereducer
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 This
code looks at all the group bys in the query block and allows for only some of
the distinct expressions and group by keys to be the same.
The new code looks for sets of group by subqueries where all of the distinct
expressions and group by keys are the same.
I'll go back and see if I reused as much of the definitions of the two
functions getCommonDistinctExprs and getCommonGroupByKeys as I could have, but
I don't think this code here could be simplified using my new code.
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 The
above if block guarantees the job can be performed using only 2 MR jobs
provided that there are no filter operators, and the distinct keys are the same
across all subqueries.
If my new code were used, there is no hard limit on the number of MR jobs
because it depends on how many variations there are on the group by keys.
I think it is possible that we could first look at the set of subqueries
without filters, see how they split on both distinct and group by keys, and for
any of these groups where two or more have the same distinct keys run this
other method on them. However, this could have been done much more easily
before (without the comparison, just group queries with the same distinct keys
together), and it wasn't, so I suspect the gains are not that great, so I'd
much rather do this as part of a separate patch once we have queries that would
benefit from it.
I'm not sure whether or not the code is broken, but I would like to leave it
for now, to be fixed, or modified as I described above as part of a separate
patch.
REVISION DETAIL
https://reviews.facebook.net/D567
> Allow multiple group bys with the same input data and spray keys to be run on
> the same reducer.
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-2621
> URL: https://issues.apache.org/jira/browse/HIVE-2621
> Project: Hive
> Issue Type: New Feature
> Reporter: Kevin Wilfong
> Assignee: Kevin Wilfong
> Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch,
> HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each
> insertion subclause consists of a simple query followed by a group by, the
> group bys for each clause are run on a separate reducer. This requires
> writing the data for each group by clause to an intermediate file, and then
> reading it back. This uses a significant amount of the total CPU consumed by
> the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by
> keys, with all of the group by expressions for a group of subclauses run on a
> single reducer, this would reduce the amount of reading/writing to
> intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a
> the filters for each subclause 'or'd together (provided each subclause has a
> filter) followed by a reduce sink. In the reducer, the child operators would
> be each subclauses filter followed by the group by and any subsequent
> operations.
> Note that this would require turning off map aggregation, so we would need to
> make using this type of plan configurable.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira