[ https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661735#comment-17661735 ]
Rok Mihevc commented on ARROW-4713: ----------------------------------- This issue has been migrated to [issue #21238|https://github.com/apache/arrow/issues/21238] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [C++] Improve C++ Orc Adapter performance and memory footprint > -------------------------------------------------------------- > > Key: ARROW-4713 > URL: https://issues.apache.org/jira/browse/ARROW-4713 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Yurui Zhou > Priority: Major > Labels: orc, pull-request-available > Time Spent: 7h 20m > Remaining Estimate: 0h > > Currently the Arrow C++ provide a naive adapter implementation that allow > user to read orc file to Arrow RecordBatch. However, this implementation have > several drawbacks: > * Inefficient conversion that incurs huge memcpy overhead > ** currently the ORCĀ adapter are performing byte to byte memcpy to move data > to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC > VectorBatch shares the same memory layout with Arrow in most of the Data Types > * Huge memory footprint because the lack of TableReader implementation > ** The ORC adapter currently only allow user to read data with the unit of > stripe. However, as a columnar format with high compression ration, data read > from a ORC stripe can potential takes over gigabytes of memory, which makes > the ORC adapter not quite usable in production environment. > Here we propose a new ORC adapter implementation to fix the issues mentioned > above: > * To reduce conversion overhead, instead of performing naive data copy, the > new adapter would be able to fully taking advantage of the memory layout > similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new > adapter will perform pointer manipulation to transfer the memory ownership > from VectorBatch to Arrow RecordBatch whenever possible. > * The new ORC Adapter would be able to provide user a row level granularity > when reading data from Orc File. The user should be able to specify how many > rows should be expected on output RecordBatch and the ORC Adapter should make > sure no more the requested number of rows would be returned. -- This message was sent by Atlassian Jira (v8.20.10#820010)