Selina Zhang created HIVE-10036:
-----------------------------------

             Summary: Writing ORC format big table causes OOM - too many fixed 
sized stream buffers
                 Key: HIVE-10036
                 URL: https://issues.apache.org/jira/browse/HIVE-10036
             Project: Hive
          Issue Type: Improvement
            Reporter: Selina Zhang
            Assignee: Selina Zhang


ORC writer keeps multiple out steams for each column. Each output stream is 
allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
table, the memory cost is unbearable. Specially when HCatalog dynamic partition 
involves, several hundreds files may be open and writing at the same time (same 
problems for FileSinkOperator). 

Global ORC memory manager controls the buffer size, but it only got kicked in 
at 5000 rows interval. An enhancement could be done here, but the problem is 
reducing the buffer size introduces worse compression and more IOs in read 
path. Sacrificing the read performance is always not a good choice. 

I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
to the existing configurable buffer size. Most of the streams does not need 
large buffer so the performance got improved significantly. Comparing to 
Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 

Solving OOM for ORC completely maybe needs lots of effort , but this is 
definitely a low hanging fruit. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to