[jira] [Created] (ARROW-4713) Improve C++ Orc Adapter performance and memory footprint

Yurui Zhou (JIRA) Thu, 28 Feb 2019 01:03:43 -0800

Yurui Zhou created ARROW-4713:
---------------------------------

             Summary: Improve C++ Orc Adapter performance and memory footprint
                 Key: ARROW-4713
                 URL: https://issues.apache.org/jira/browse/ARROW-4713
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Yurui Zhou



Currently the Arrow C++ provide a naive adapter implementation that allow user 
to read orc file to Arrow RecordBatch. However, this implementation have 
several drawbacks:
 * Inefficient conversion that incurs huge memcpy overhead
 ** currently the ORC adapter are performing byte to byte memcpy to move data 
to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
VectorBatch shares the same memory layout with Arrow in most of the Data Types
 * Huge memory footprint because the lack of TableReader implementation
 ** The ORC adapter currently only allow user to read data with the unit of 
stripe. However, as a columnar format with high compression ration, data read 
from a ORC stripe can potential takes over gigabytes of memory, which makes the 
ORC adapter not quite usable in production environment.

Here we propose a new ORC adapter implementation to fix the issue 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-4713) Improve C++ Orc Adapter performance and memory footprint

Reply via email to