[ 
https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626037#comment-13626037
 ] 

Phabricator commented on HIVE-4248:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-4248 [jira] Implement a memory 
manager for ORC".

  This allows for cases where the memory used could exceed the amount of memory 
allocated by significant amounts.

  E.g. say totalMemoryPool = 256 Mb = stripe size, also say we have a writer 
that writes 255 Mb to a stripe, then a second writer is created (e.g. a new 
dynamic partition value is encountered) and all new rows get written to this 
second writer, than nothing will get written out until the second writer 
accumulates 128 Mb of data in the stripe using a total of 383 Mb of the 
allocated 256 Mb.  In theory, with some terrible luck, these could be chained 
together to use significantly more memory (first writer writes 255 Mb, second 
writes 127 Mb, third writes 85 Mb, etc.)

  Could you loop through the stripes whenever a writer is added (shouldn't 
happen to frequently) and check if the estimated stripe size of any of these 
writers exceeds the value of stripeSize * memoryManager.getAllocationScale() 
(should be doable by making a couple methods public and storing a reference to 
the WriterImpl along with or instead of the Path).

  Also (could be done in a follow up) could there be an additional check to see 
what the total HeapMemoryUsage is?  E.g. in the shouldBeFlushed method of 
GroupByOperator, every 1000 rows, it checks that no more than 90% of the total 
heap has been used, and if so it flushes the hash map.  Something similar could 
be done for WriterImpl, and given the MemoryManager, could even flush the 
largest stripe, rather than just the one that pushed it over the edge.   This 
would be particularly useful given that in the case of a map join, followed by 
a map aggregation, the mapjoin is allowed to use 55% of the memory, and the 
group by another 30%, if there was also a FileSinkOpeartor, allowing the ORC 
WriterImpl to use 50% could be too much.

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:490 could you add 
this to conf/hive-default.xml.template as well.

REVISION DETAIL
  https://reviews.facebook.net/D9993

To: JIRA, omalley
Cc: kevinwilfong

                
> Implement a memory manager for ORC
> ----------------------------------
>
>                 Key: HIVE-4248
>                 URL: https://issues.apache.org/jira/browse/HIVE-4248
>             Project: Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch
>
>
> With the large default stripe size (256MB) and dynamic partitions, it is 
> quite easy for users to run out of memory when writing ORC files. We probably 
> need a solution that keeps track of the total number of concurrent ORC 
> writers and divides the available heap space between them. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to