[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541929#comment-16541929 ]
Szehon Ho commented on HIVE-20153: ---------------------------------- [~aihuaxu] do you think there is some way to improve this? (I didn't yet take much look at this code to deeply understand). It seems to consume memory even if its used in the window function or not. The query is something like (generalizing the table): select count(distinct), count(), count(), count(), min(), min(), max(), max(), min(), max() from table group by field; Also I attach the heap dump of a mapper that was killed OOM for reference, there's 3 million GenericUDAFCountEvaluator, each with a hashmap, I also don't know if that is weird or not. !Screen Shot 2018-07-12 at 6.41.28 PM.png! > Count and Sum UDF consume more memory in Hive 2+ > ------------------------------------------------ > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF > Affects Versions: 2.3.2 > Reporter: Szehon Ho > Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side much faster than in > Hive1. In many queries, we have to double the memory. > > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)