Hi

 

Currently the Arrow C++ provide a naive adapter implementation that allow user 
to read orc file to Arrow RecordBatch. However, this implementation have 
several drawbacks:

 
Inefficient conversion that incurs huge memcpy overhead
currently the ORC adapter are performing byte to byte memcpy to move data to 
ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
VectorBatch shares the same memory layout with Arrow in most of the Data Types
Huge memory footprint because the lack of TableReader implementation
The ORC adapter currently only allow user to read data with the unit of stripe. 
However, as a columnar format with high compression ration, data read from a 
ORC stripe can potential takes over gigabytes of memory, which makes the ORC 
adapter not quite usable in production environment.
Here we propose a new ORC adapter implementation to fix the issues mentioned 
above:
To reduce conversion overhead, instead of performing naive data copy, the new 
adapter would be able to fully taking advantage of the memory layout similarity 
between ORC VectorBatch and Arrow RecordBatch. Namely the new adapter will 
perform pointer manipulation to transfer the memory ownership from VectorBatch 
to Arrow RecordBatch whenever possible.
The new ORC Adapter would be able to provide user a row level granularity when 
reading data from Orc File. The user should be able to specify how many rows 
should be expected on output RecordBatch and the ORC Adapter should make sure 
no more the requested number of rows would be returned.
I opened a Jira here to track the issue. Any advice would be appreciated.

BTW, I tried to assigned the Jira to myself but looks like I am unable to do 
that. Any idea how could I obtain the permission to perform the operation?

Best regards

Yurui

Reply via email to