[ https://issues.apache.org/jira/browse/HIVE-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasanth Jayachandran updated HIVE-19206: ----------------------------------------- Attachment: HIVE-19206.1.patch > Automatic memory management for open streaming writers > ------------------------------------------------------ > > Key: HIVE-19206 > URL: https://issues.apache.org/jira/browse/HIVE-19206 > Project: Hive > Issue Type: Sub-task > Components: Streaming > Affects Versions: 3.0.0, 3.1.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Priority: Major > Attachments: HIVE-19206.1.patch > > > Problem: > When there are 100s of record updaters open, the amount of memory required > by orc writers keeps growing because of ORC's internal buffers. This can lead > to potential high GC or OOM during streaming ingest. > Solution: > The high level idea is for the streaming connection to remember all the open > record updaters and flush the record updater periodically (at some interval). > Records written to each record updater can be used as a metric to determine > the candidate record updaters for flushing. > If stripe size of orc file is 64MB, the default memory management check > happens only after every 5000 rows which may which may be too late when there > are too many concurrent writers in a process. Example case would be 100 > writers open and each of them have almost full stripe of 64MB buffered data, > this would take 100*64MB ~=6GB of memory. When all of the record writers > flush, the memory usage drops down to 100*~2MB which is just ~200MB memory > usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)