[ https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Selina Zhang updated HIVE-10036: -------------------------------- Attachment: (was: HIVE-10036.8.patch) > Writing ORC format big table causes OOM - too many fixed sized stream buffers > ----------------------------------------------------------------------------- > > Key: HIVE-10036 > URL: https://issues.apache.org/jira/browse/HIVE-10036 > Project: Hive > Issue Type: Improvement > Reporter: Selina Zhang > Assignee: Selina Zhang > Labels: orcfile > Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, > HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch > > > ORC writer keeps multiple out steams for each column. Each output stream is > allocated fixed size ByteBuffer (configurable, default to 256K). For a big > table, the memory cost is unbearable. Specially when HCatalog dynamic > partition involves, several hundreds files may be open and writing at the > same time (same problems for FileSinkOperator). > Global ORC memory manager controls the buffer size, but it only got kicked in > at 5000 rows interval. An enhancement could be done here, but the problem is > reducing the buffer size introduces worse compression and more IOs in read > path. Sacrificing the read performance is always not a good choice. > I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound > to the existing configurable buffer size. Most of the streams does not need > large buffer so the performance got improved significantly. Comparing to > Facebook's hive-dwrf, I monitored 2x performance gain with this fix. > Solving OOM for ORC completely maybe needs lots of effort , but this is > definitely a low hanging fruit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)