hi Ying, the code in adapter_util.cc doesn't look right to me unless
the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes
where there is a null). We have quite a bit of code in Parquet that
deals specifically with this issue -- I'm not sure if we have a
ready-made function that will efficiently append the "compressed"
value efficiently to a builder, but we certianly have all the tools
you need to do so (e.g. the BitRunReader is helpful here)

On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote:
>
> Hi,
>
> Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT 
> stream (equivalent to the validity bitmap in Arrow) but not in any DATA 
> stream for any type including numeric types. Hence the notNull (aka PRESENT) 
> and data buffers from ORC generally don’t have the same size.
>
> However according to cpp/src/arrow/adaptes/orc/adapter_util.cc 
> <http://adapter_util.cc/> line 126 it is possible to directly use 
> AppendValues to call builder->AppendValues(source, length, valid_bytes) with 
> builder being an Int64Builder with source and valid_bytes having different 
> sizes which doesn’t seem to be reasonable. May I ask whether this is actually 
> valid usage of AppendValues? Thanks!
>
>
> Best,
> Ying Zhou

Reply via email to