Hi Antoine and all ! Sorry for the delay, I wanted to understand things a bit better before getting back to you. As discussed, I focussed on the Parquet case. I've looked into parquet/encoding.cc to see what could be done to have a better memory reservation with ByteArrays.
On my journey, I noticed a few things: - The parquet dictionary is first deserialized into an array of views, then converted to an Arrow style variable/fixed size binary array (here <https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652>/ here <https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670>). This later conversion is done even if it is not used afterwards, in fact it seems to be only used here <https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845>(never used for fixed size binary and not used either for variable size binary if the target Arrow array is not dict encoded). So if you confirm my "discoveries", we could spare ourselves some dictionary conversions by doing them lazily when *InsertDictionary *is called. - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull <https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030> and DictByteArrayDecoderImpl::DecodeArrow <https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974> have been short circuited inside ByteArrayDictionaryRecordReader::ReadValuesDense <https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532> and ByteArrayDictionaryRecordReader::ReadValuesSpaced <https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548> (dictionary encoded data pages are always RLE in parquet). This would mean that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of public, but it is very confusing as they are not used by Arrow itself ;-) Finaly, regarding your idea (@antoine) of using a simple heuristic to pre-allocate the array, I'm not totally comfortable (maybe wrongly) with doing the over-allocation at that level because I'm not sure what the true contract of the memory pool and the associated allocator is. Does this interface guarantee that the memory allocated will not be zeroed out or touched in some similar fashion that would trigger physical memory mapping ? With my suggestion of using mmap to do huge allocations, I am positioning myself at a level where I know for sure that the underlying physical memory is not mapped because we don't touch the memory. But as you noticed, I have less knowledge about the context so it's very hard to decide when to trigger the "runway" pre-allocation or not. Hope this all makes sense. Took me a while to understand how the decoding works ;-) Remi Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a écrit : > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit : > > I looked into the details of why the decoder could not estimate the > target > > Arrow array size for my Parquet column. It's because I am decoding from > > Parquet-Dictionary to Arrow-Plain (which is the default when loading > > Parquet). In this case the size prediction is impossible :-( > > But we can probably make up a heuristic. For example > avg(dictionary value size) * number of non-null values > > It would avoid a number of resizes, even though there may still be a > couple of them at the end. It may oversize in some cases, but much less > than your proposed strategy of reserving a huge chunk of virtual memory :-) > > Regards > > Antoine. >