hi Ying, the code in adapter_util.cc doesn't look right to me unless the data in liborc::ColumnVectorBatch is spaced (has placeholder bytes where there is a null). We have quite a bit of code in Parquet that deals specifically with this issue -- I'm not sure if we have a ready-made function that will efficiently append the "compressed" value efficiently to a builder, but we certianly have all the tools you need to do so (e.g. the BitRunReader is helpful here)
On Sun, Oct 18, 2020 at 12:24 PM Ying Zhou <yzhou7...@gmail.com> wrote: > > Hi, > > Unlike Arrow in ORC when an entry is null it is only recorded in the PRESENT > stream (equivalent to the validity bitmap in Arrow) but not in any DATA > stream for any type including numeric types. Hence the notNull (aka PRESENT) > and data buffers from ORC generally don’t have the same size. > > However according to cpp/src/arrow/adaptes/orc/adapter_util.cc > <http://adapter_util.cc/> line 126 it is possible to directly use > AppendValues to call builder->AppendValues(source, length, valid_bytes) with > builder being an Int64Builder with source and valid_bytes having different > sizes which doesn’t seem to be reasonable. May I ask whether this is actually > valid usage of AppendValues? Thanks! > > > Best, > Ying Zhou