Ryu Kobayashi created HIVE-28428:
------------------------------------

             Summary:  Map hash aggregation performance degradation
                 Key: HIVE-28428
                 URL: https://issues.apache.org/jira/browse/HIVE-28428
             Project: Hive
          Issue Type: Improvement
            Reporter: Ryu Kobayashi
         Attachments: 2024-08-02 14.35.46.png, 
image-2024-08-02-14-37-01-824.png, image-2024-08-02-14-38-45-459.png

The following ticket has been fixed to enable map hash aggregation, but 
performance degradation than when it is disabled.
https://issues.apache.org/jira/browse/HIVE-23356

I found a few reasons for this. If there are a large number of keys, the 
following log will be output in large volume, affecting performance. And, this 
can also cause an OOM.
{code:java}
2024-08-02 05:21:53,675 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl 
flush: #hash table = 171000
2024-08-02 05:21:53,713 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table 
flushed: new size = 153900
{code}
By fixing this, we can improve performance as follows.
Before:

!image-2024-08-02-14-37-01-824.png!

After:

!2024-08-02 14.35.46.png!

And, currently the flush size is fixed, but performance can be improved by 
changing it depending on the data:

!image-2024-08-02-14-38-45-459.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to