Hello-

I’ve found a condition where the MemoryManager will wait too long before 
notifying writers to check their memory and flush.


This issue affects anyone who is writing a lot of columns, very large columns, 
or worst of all: both. I have tested and confirmed this issue on hive 0.12, 
0.13, and trunk.

Doing some searching it looks like other folks have been running into this as 
well. The issue manifests itself as large GC pauses that eventually throw the 
exception below when writing data. Tuning hive.exec.orc.memory.pool, or any of 
the orc params has no apparent affect when hitting this issue.

java.lang.OutOfMemoryError: Java heap space
        java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
        
org.apache.hadoop.hive.ql.io.orc.OutStream.getNewInputBuffer(OutStream.java:107)
        org.apache.hadoop.hive.ql.io.orc.OutStream.spill(OutStream.java:223)
        org.apache.hadoop.hive.ql.io.orc.OutStream.flush(OutStream.java:239)
...

I ran into this issue while generating ORCs, but I believe it affects all 
storage types.  The only present workaround is to give tasks lots of extra 
memory.

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L50

The issue is on line 50: ROWS_BETWEEN_CHECKS = 5000;

For large or many columns it’s easy to hit GC issues or OOM before 5k rows are 
written.

I believe that rows-between-checks should be made a configuration parameter 
that can be passed in on the JobConf.

Does this suggestion make sense?  If so I can open a Jira ticket and throw some 
code together.

Thank you,

Sean

Reply via email to