Ryu Kobayashi created HIVE-28428: ------------------------------------ Summary: Map hash aggregation performance degradation Key: HIVE-28428 URL: https://issues.apache.org/jira/browse/HIVE-28428 Project: Hive Issue Type: Improvement Reporter: Ryu Kobayashi Attachments: 2024-08-02 14.35.46.png, image-2024-08-02-14-37-01-824.png, image-2024-08-02-14-38-45-459.png
The following ticket has been fixed to enable map hash aggregation, but performance degradation than when it is disabled. https://issues.apache.org/jira/browse/HIVE-23356 I found a few reasons for this. If there are a large number of keys, the following log will be output in large volume, affecting performance. And, this can also cause an OOM. {code:java} 2024-08-02 05:21:53,675 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl flush: #hash table = 171000 2024-08-02 05:21:53,713 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table flushed: new size = 153900 {code} By fixing this, we can improve performance as follows. Before: !image-2024-08-02-14-37-01-824.png! After: !2024-08-02 14.35.46.png! And, currently the flush size is fixed, but performance can be improved by changing it depending on the data: !image-2024-08-02-14-38-45-459.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)