[ https://issues.apache.org/jira/browse/ARROW-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17661564#comment-17661564 ]
Rok Mihevc commented on ARROW-4542: ----------------------------------- This issue has been migrated to [issue #21090|https://github.com/apache/arrow/issues/21090] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [C++][Parquet] Denominate row group size in bytes (not in no of rows) > --------------------------------------------------------------------- > > Key: ARROW-4542 > URL: https://issues.apache.org/jira/browse/ARROW-4542 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Remek Zajac > Priority: Major > > Both the C++ [implementation of parquet writer for > arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174] > and the [Python code bound to > it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911] > appears denominated in the *number of rows* (without making it very > explicit). Whereas: > (1) [The Apache parquet > documentation|https://parquet.apache.org/documentation/latest/] states: > "_Row group size: Larger row groups allow for larger column chunks which > makes it possible to do larger sequential IO. Larger groups also require more > buffering in the write path (or a two pass write). *We recommend large row > groups (512MB - 1GB)*. Since an entire row group might need to be read, we > want it to completely fit on one HDFS block. Therefore, HDFS block sizes > should also be set to be larger. An optimized read setup would be: 1GB row > groups, 1GB HDFS block size, 1 HDFS block per HDFS file._" > (2) Reference Apache [parquet-mr > implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146] > for Java accepts the row size expressed in bytes. > (3) The [low-level parquet read-write > example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88] > also considers row group be denominated in bytes. > These insights make me conclude that: > * Per parquet design and to take advantage of HDFS block level operations, > it only makes sense to work with row group sizes as expressed in bytes - as > that is the only consequential desire the caller can utter and want to > influence. > * Arrow implementation of ParquetWriter would benefit from re-nominating its > `row_group_size` into bytes. I will also note it is impossible to use pyarrow > to shape equally byte-sized row groups as the size the row group takes is > post-compression and the caller only know how much uncompressed data they > have managed to put in. > Now, my conclusions can be wrong and I may be blind to some alley of > reasoning, so this ticket is more of a question than a bug. A question on > whether the audience here agrees with my reasoning and if not - to explain > what detail I have missed. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)