Hi Antoine and all !

Sorry for the delay, I wanted to understand things a bit better before
getting back to you.  As discussed, I focussed on the Parquet case. I've
looked into parquet/encoding.cc to see what could be done to have a better
memory reservation with ByteArrays.

On my journey, I noticed a few things:
- The parquet dictionary is first deserialized into an array of views, then
converted to an Arrow style variable/fixed size binary array (here
<https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652>/
here
<https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670>).
This later conversion is done even if it is not used afterwards, in fact it
seems to be only used here
<https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845>(never
used for fixed size binary and not used either for variable size binary if
the target Arrow array is not dict encoded). So if you confirm my
"discoveries", we could spare ourselves some dictionary conversions by
doing them lazily when *InsertDictionary *is called.
- It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
<https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030>
and DictByteArrayDecoderImpl::DecodeArrow
<https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974>
have been short circuited inside
ByteArrayDictionaryRecordReader::ReadValuesDense
<https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532>
and ByteArrayDictionaryRecordReader::ReadValuesSpaced
<https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548>
(dictionary encoded data pages are always RLE in parquet). This would mean
that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of
public, but it is very confusing as they are not used by Arrow itself ;-)

Finaly, regarding your idea (@antoine) of using a simple heuristic to
pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
doing the over-allocation at that level because I'm not sure what the true
contract of the memory pool and the associated allocator is. Does this
interface guarantee that the memory allocated will not be zeroed out or
touched in some similar fashion that would trigger physical memory mapping
? With my suggestion of using mmap to do huge allocations, I am positioning
myself at a level where I know for sure that the underlying physical memory
is not mapped because we don't touch the memory. But as you noticed, I have
less knowledge about the context so it's very hard to decide when to
trigger the "runway" pre-allocation or not.

Hope this all makes sense. Took me a while to understand how the decoding
works ;-)

Remi



Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a écrit :

>
> Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > I looked into the details of why the decoder could not estimate the
> target
> > Arrow array size for my Parquet column. It's because I am decoding from
> > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > Parquet). In this case the size prediction is impossible :-(
>
> But we can probably make up a heuristic.  For example
>    avg(dictionary value size) * number of non-null values
>
> It would avoid a number of resizes, even though there may still be a
> couple of them at the end.  It may oversize in some cases, but much less
> than your proposed strategy of reserving a huge chunk of virtual memory :-)
>
> Regards
>
> Antoine.
>

Reply via email to