Hi all,

I was wondering if anyone could elaborate on why the default maximum row group 
length is set to 
67108864<https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>.
 From Apache Parquet's documentation, the recommended row group size is between 
512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/> For a 
Float64Array whose length is 67108864, I believe its size would be 
approximately 545 MB, which is on the low end of that interval.

I was wondering if there was a particular reason why 67108864 was chosen as the 
maximum row group length. I experimented with setting the default maximum row 
group length to larger values and noticed pyarrow cannot import Parquet files 
containing row groups whose lengths exceed 2147483647 rows (int32 max). 
However, I was able to read these files in using the C++ Arrow bindings.


Best,
Sarah


Reply via email to