[ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378176#comment-14378176
 ] 

Selina Zhang commented on HIVE-10036:
-------------------------------------

The review request:
https://reviews.apache.org/r/32445/

Thank you!

> Writing ORC format big table causes OOM - too many fixed sized stream buffers
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-10036
>                 URL: https://issues.apache.org/jira/browse/HIVE-10036
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Selina Zhang
>            Assignee: Selina Zhang
>         Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
> HIVE-10036.3.patch
>
>
> ORC writer keeps multiple out steams for each column. Each output stream is 
> allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
> table, the memory cost is unbearable. Specially when HCatalog dynamic 
> partition involves, several hundreds files may be open and writing at the 
> same time (same problems for FileSinkOperator). 
> Global ORC memory manager controls the buffer size, but it only got kicked in 
> at 5000 rows interval. An enhancement could be done here, but the problem is 
> reducing the buffer size introduces worse compression and more IOs in read 
> path. Sacrificing the read performance is always not a good choice. 
> I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
> to the existing configurable buffer size. Most of the streams does not need 
> large buffer so the performance got improved significantly. Comparing to 
> Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
> Solving OOM for ORC completely maybe needs lots of effort , but this is 
> definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to