[ https://issues.apache.org/jira/browse/HIVE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhihua Deng updated HIVE-28428: ------------------------------- Labels: hive-4.0.1-merged hive-4.0.1-must performance pull-request-available (was: hive-4.0.1-must performance pull-request-available) > Map hash aggregation performance degradation > --------------------------------------------- > > Key: HIVE-28428 > URL: https://issues.apache.org/jira/browse/HIVE-28428 > Project: Hive > Issue Type: Improvement > Components: Hive, Operators, Query Processor > Reporter: Ryu Kobayashi > Assignee: Ryu Kobayashi > Priority: Major > Labels: hive-4.0.1-merged, hive-4.0.1-must, performance, > pull-request-available > Fix For: 4.1.0 > > Attachments: 2024-08-02 14.35.46.png, > image-2024-08-02-14-37-01-824.png, image-2024-08-02-14-38-45-459.png > > > The following ticket has been fixed to enable map hash aggregation, but > performance degradation than when it is disabled. > https://issues.apache.org/jira/browse/HIVE-23356 > I found a few reasons for this. If there are a large number of keys, the > following log will be output in large volume, affecting performance. And, > this can also cause an OOM. > {code:java} > 2024-08-02 05:21:53,675 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl > flush: #hash table = 171000 > 2024-08-02 05:21:53,713 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table > flushed: new size = 153900 > {code} > By fixing this, we can improve performance as follows. > Before: > !image-2024-08-02-14-37-01-824.png! > After: > !2024-08-02 14.35.46.png! > And, currently the flush size is fixed, but performance can be improved by > changing it depending on the data: > !image-2024-08-02-14-38-45-459.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)