[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626037#comment-13626037 ]
Phabricator commented on HIVE-4248: ----------------------------------- kevinwilfong has commented on the revision "HIVE-4248 [jira] Implement a memory manager for ORC". This allows for cases where the memory used could exceed the amount of memory allocated by significant amounts. E.g. say totalMemoryPool = 256 Mb = stripe size, also say we have a writer that writes 255 Mb to a stripe, then a second writer is created (e.g. a new dynamic partition value is encountered) and all new rows get written to this second writer, than nothing will get written out until the second writer accumulates 128 Mb of data in the stripe using a total of 383 Mb of the allocated 256 Mb. In theory, with some terrible luck, these could be chained together to use significantly more memory (first writer writes 255 Mb, second writes 127 Mb, third writes 85 Mb, etc.) Could you loop through the stripes whenever a writer is added (shouldn't happen to frequently) and check if the estimated stripe size of any of these writers exceeds the value of stripeSize * memoryManager.getAllocationScale() (should be doable by making a couple methods public and storing a reference to the WriterImpl along with or instead of the Path). Also (could be done in a follow up) could there be an additional check to see what the total HeapMemoryUsage is? E.g. in the shouldBeFlushed method of GroupByOperator, every 1000 rows, it checks that no more than 90% of the total heap has been used, and if so it flushes the hash map. Something similar could be done for WriterImpl, and given the MemoryManager, could even flush the largest stripe, rather than just the one that pushed it over the edge. This would be particularly useful given that in the case of a map join, followed by a map aggregation, the mapjoin is allowed to use 55% of the memory, and the group by another 30%, if there was also a FileSinkOpeartor, allowing the ORC WriterImpl to use 50% could be too much. INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:490 could you add this to conf/hive-default.xml.template as well. REVISION DETAIL https://reviews.facebook.net/D9993 To: JIRA, omalley Cc: kevinwilfong > Implement a memory manager for ORC > ---------------------------------- > > Key: HIVE-4248 > URL: https://issues.apache.org/jira/browse/HIVE-4248 > Project: Hive > Issue Type: New Feature > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch > > > With the large default stripe size (256MB) and dynamic partitions, it is > quite easy for users to run out of memory when writing ORC files. We probably > need a solution that keeps track of the total number of concurrent ORC > writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira