[ 
https://issues.apache.org/jira/browse/HIVE-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925001#comment-13925001
 ] 

Remus Rusanu commented on HIVE-6222:
------------------------------------

The 1.patch refactors the VectorGroupByOperator to delegate the algorithm used 
to a nested processingMode object. Three processing modes are provided:

 - global aggregate. This is the trivial mode when there are no keys. All 
values are aggregated into a single row of aggregation buffers and the values 
are emitted at operator closeOp()
 - hash aggregate. This is all the previous VGBy operator logic,with hash table 
and including memory pressure flushes
 - streaming aggregate. This mode aggregates intermediate values as keys change 
in the input and flushes at each key value change. It relies on MR shuffle and 
row-mode GBy reduce phase to merge the intermediate values. Due to the way 
aggregators operate on batches, the logic of flushing is not strictly 'on new 
key' but 'for all new keys in a batch, except last'. Identical Identical keys 
in a batch are not aggregated, unless they make a contiguous run.

This patch will conflict with HIVE-6518 because the relevant code is moved into 
the new nested ProcessingModeHashAggregate class. Porting the fix is trivial. I 
will rebase either this or HIVE-6518 depending which gets committed first.

> Make Vector Group By operator abandon grouping if too many distinct keys
> ------------------------------------------------------------------------
>
>                 Key: HIVE-6222
>                 URL: https://issues.apache.org/jira/browse/HIVE-6222
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>            Priority: Minor
>         Attachments: HIVE-6222.1.patch
>
>
> Row mode GBY is becoming a pass-through if not enough aggregation occurs on 
> the map side, relying on the shuffle+reduce side to do the work. Have VGBY do 
> the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to