> What are the tradeoffs between a low and large and row group size? I can give some perspective from the C++ work I've been doing. I believe this inspired some of the recommendations Jon is referring to. At the moment we have a number of limitations that aren't limitations in the format but more limitations in the Arrow C++ library and how we consume the data.
The most significant issue is memory pressure, especially in the C++ dataset scanner. You can see an example here[1]. This isn't entirely inevitable though, even if row groups are large. The main problem there is that the C++ dataset scanner implements readahead on a "readahead X # of row groups" basis and not "readahead X # of bytes basis". I have a ticket in place[2] which will hopefully improve the dataset scanner's ability to work with files that have large row groups, but I'm not actively working on it. That should generally allow for larger row groups without blowing up RAM when running queries through the C++ compute layer. In addition, we may investigate better mechanisms for "sliced reads" of both parquet and IPC. For example, the Arrow C++ parquet reader can only read an entire row group at a time. When my input is S3 I am often scanning multiple batches at once. If a row group is 1GB and I am scanning ten files in parallel then I end up with 10GB of "cached data" that I then slowly scan through. This is wasteful from a RAM perspective if I'm trying to do streaming processing. Parquet could support reads at data-page resolution (the underlying parquet-cpp lib may actually have this, I haven't looked too deeply yet) which would solve this problem. Also, the IPC format could support a similar concept at any desired resolution. Some work could be done in both cases to make sure we only read / decode the metadata once. Another way to solve the above problem is to rely on more parallel reads within a single record batch and deprioritize reading multiple batches or multiple files at once. But then that can get kind of tricky to manage if you have a large number of small files. > Is it that a low value allows for quicker random access (as we can seek row > groups based on the number of rows they have), while a larger value allows > for higher dict-encoding and compression ratios? Similar to this is the idea of push-down predicates. If the row groups are smaller, then there is a greater chance that you can skip entire row groups based on the statistics. This only applies to parquet since we don't have any group statistics in the IPC format. Another way to tackle this problem would be to rely more on data page statistics. However, the C++ Arrow parquet reader doesn't support using page-level statistics for push down predicates. Finally, one other issue that comes into play, is the width of your data. Really wide datasets (e.g. tens of thousands of columns) suffer from having rather large metadata blocks. If your row groups start to get small then you end up spending a lot of time parsing metadata and much less time actually reading data. [1] https://issues.apache.org/jira/browse/ARROW-14736 [2] https://issues.apache.org/jira/browse/ARROW-14648 On Wed, Nov 17, 2021 at 10:22 AM Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > What are the tradeoffs between a low and large and row group size? > > Is it that a low value allows for quicker random access (as we can seek row > groups based on the number of rows they have), while a larger value allows > for higher dict-encoding and compression ratios? > > Best, > Jorge > > > > > On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jke...@gmail.com> wrote: > > > This doesn't address the large number of row groups ticket that was > > raised, but for some visibility: there is some work to change the row > > group sizing based on the size of data instead of a static number of > > rows [1] as well as exposing a few more knobs to tune [2] > > > > There is a bit of prior art in the R implementation for attempting to > > get a reasonable row group size based on the shape of the data > > (basically, aims to have row groups that have 250 Million cells in > > them). [3] > > > > [1] https://issues.apache.org/jira/browse/ARROW-4542 > > [2] https://issues.apache.org/jira/browse/ARROW-14426 and > > https://issues.apache.org/jira/browse/ARROW-14427 > > [3] > > https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222 > > > > -Jon > > > > On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche > > <jorisvandenboss...@gmail.com> wrote: > > > > > > In addition, would it be useful to be able to change this > > max_row_group_length > > > from Python? > > > Currently that writer property can't be changed from Python, you can only > > > specify the row_group_size (chunk_size in C++) > > > when writing a table, but that's currently only useful to set it to > > > something that is smaller than the max_row_group_length. > > > > > > Joris > >