[jira] [Commented] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:45:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661735#comment-17661735
 ]


Rok Mihevc commented on ARROW-4713:
-----------------------------------

This issue has been migrated to [issue 
#21238|https://github.com/apache/arrow/issues/21238] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [C++] Improve C++ Orc Adapter performance and memory footprint
> --------------------------------------------------------------
>
>                 Key: ARROW-4713
>                 URL: https://issues.apache.org/jira/browse/ARROW-4713
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yurui Zhou
>            Priority: Major
>              Labels: orc, pull-request-available
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently the Arrow C++ provide a naive adapter implementation that allow 
> user to read orc file to Arrow RecordBatch. However, this implementation have 
> several drawbacks:
>  * Inefficient conversion that incurs huge memcpy overhead
>  ** currently the ORC adapter are performing byte to byte memcpy to move data 
> to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
> VectorBatch shares the same memory layout with Arrow in most of the Data Types
>  * Huge memory footprint because the lack of TableReader implementation
>  ** The ORC adapter currently only allow user to read data with the unit of 
> stripe. However, as a columnar format with high compression ration, data read 
> from a ORC stripe can potential takes over gigabytes of memory, which makes 
> the ORC adapter not quite usable in production environment.
> Here we propose a new ORC adapter implementation to fix the issues mentioned 
> above:
>  * To reduce conversion overhead, instead of performing naive data copy, the 
> new adapter would be able to fully taking advantage of the memory layout 
> similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new 
> adapter will perform pointer manipulation to transfer the memory ownership 
> from VectorBatch to Arrow RecordBatch whenever possible.
>  * The new ORC Adapter would be able to provide user a row level granularity 
> when reading data from Orc File. The user should be able to specify how many 
> rows should be expected on output RecordBatch and the ORC Adapter should make 
> sure no more the requested number of rows would be returned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

Reply via email to