[ https://issues.apache.org/jira/browse/HIVE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-10600: ------------------------------------ Description: Quoting [~gopalv]: {noformat} So, something like a sum() GROUP BY will create a few hundred thousand AbstractAggregationBuffer objects all of which will suddenly go out of scope when the map.aggr flushes it down to the sort buffer. That particular GC collection takes forever because the tiny buffers take a lot of time to walk over and then they leave the memory space fragmented, which requires a compaction pass (which btw, writes to a page-interleaved NUMA zone). And to make things worse, the pre-allocated sort buffers with absolutely zero data in them take up most of the tenured regions causing these chunks of memory to be visited more and more often as they are part of the Eden space. {noformat} We need flat data structures to be GC friendly. was: Quote [~gopalv]: {noformat} So, something like a sum() GROUP BY will create a few hundred thousand AbstractAggregationBuffer objects all of which will suddenly go out of scope when the map.aggr flushes it down to the sort buffer. That particular GC collection takes forever because the tiny buffers take a lot of time to walk over and then they leave the memory space fragmented, which requires a compaction pass (which btw, writes to a page-interleaved NUMA zone). And to make things worse, the pre-allocated sort buffers with absolutely zero data in them take up most of the tenured regions causing these chunks of memory to be visited more and more often as they are part of the Eden space. {noformat} We need flat data structures to be GC friendly. > optimize group by for GC > ------------------------ > > Key: HIVE-10600 > URL: https://issues.apache.org/jira/browse/HIVE-10600 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > > Quoting [~gopalv]: > {noformat} > So, something like a sum() GROUP BY will create a few hundred thousand > AbstractAggregationBuffer objects all of which will suddenly go out of > scope when the map.aggr flushes it down to the sort buffer. > That particular GC collection takes forever because the tiny buffers take > a lot of time to walk over and then they leave the memory space > fragmented, which requires a compaction pass (which btw, writes to a > page-interleaved NUMA zone). > And to make things worse, the pre-allocated sort buffers with absolutely > zero data in them take up most of the tenured regions causing these chunks > of memory to be visited more and more often as they are part of the Eden > space. > {noformat} > We need flat data structures to be GC friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)