> What are the tradeoffs between a low and large and row group size?

I can give some perspective from the C++ work I've been doing.  I
believe this inspired some of the recommendations Jon is referring to.
At the moment we have a number of limitations that aren't limitations
in the format but more limitations in the Arrow C++ library and how we
consume the data.

The most significant issue is memory pressure, especially in the C++
dataset scanner.  You can see an example here[1].  This isn't entirely
inevitable though, even if row groups are large.  The main problem
there is that the C++ dataset scanner implements readahead on a
"readahead X # of row groups" basis and not "readahead X # of bytes
basis".  I have a ticket in place[2] which will hopefully improve the
dataset scanner's ability to work with files that have large row
groups, but I'm not actively working on it.  That should generally
allow for larger row groups without blowing up RAM when running
queries through the C++ compute layer.

In addition, we may investigate better mechanisms for "sliced reads"
of both parquet and IPC.  For example, the Arrow C++ parquet reader
can only read an entire row group at a time.  When my input is S3 I am
often scanning multiple batches at once.  If a row group is 1GB and I
am scanning ten files in parallel then I end up with 10GB of "cached
data" that I then slowly scan through.  This is wasteful from a RAM
perspective if I'm trying to do streaming processing.  Parquet could
support reads at data-page resolution (the underlying parquet-cpp lib
may actually have this, I haven't looked too deeply yet) which would
solve this problem.  Also, the IPC format could support a similar
concept at any desired resolution.  Some work could be done in both
cases to make sure we only read / decode the metadata once.

Another way to solve the above problem is to rely on more parallel
reads within a single record batch and deprioritize reading multiple
batches or multiple files at once.  But then that can get kind of
tricky to manage if you have a large number of small files.

> Is it that a low value allows for quicker random access (as we can seek row
> groups based on the number of rows they have), while a larger value allows
> for higher dict-encoding and compression ratios?

Similar to this is the idea of push-down predicates.  If the row
groups are smaller, then there is a greater chance that you can skip
entire row groups based on the statistics.  This only applies to
parquet since we don't have any group statistics in the IPC format.
Another way to tackle this problem would be to rely more on data page
statistics.  However, the C++ Arrow parquet reader doesn't support
using page-level statistics for push down predicates.

Finally, one other issue that comes into play, is the width of your
data.  Really wide datasets (e.g. tens of thousands of columns) suffer
from having rather large metadata blocks.  If your row groups start to
get small then you end up spending a lot of time parsing metadata and
much less time actually reading data.

[1] https://issues.apache.org/jira/browse/ARROW-14736
[2] https://issues.apache.org/jira/browse/ARROW-14648

On Wed, Nov 17, 2021 at 10:22 AM Jorge Cardoso Leitão
<jorgecarlei...@gmail.com> wrote:
>
> What are the tradeoffs between a low and large and row group size?
>
> Is it that a low value allows for quicker random access (as we can seek row
> groups based on the number of rows they have), while a larger value allows
> for higher dict-encoding and compression ratios?
>
> Best,
> Jorge
>
>
>
>
> On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jke...@gmail.com> wrote:
>
> > This doesn't address the large number of row groups ticket that was
> > raised, but for some visibility: there is some work to change the row
> > group sizing based on the size of data instead of a static number of
> > rows [1] as well as exposing a few more knobs to tune [2]
> >
> > There is a bit of prior art in the R implementation for attempting to
> > get a reasonable row group size based on the shape of the data
> > (basically, aims to have row groups that have 250 Million cells in
> > them). [3]
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-4542
> > [2] https://issues.apache.org/jira/browse/ARROW-14426 and
> > https://issues.apache.org/jira/browse/ARROW-14427
> > [3]
> > https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222
> >
> > -Jon
> >
> > On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
> > <jorisvandenboss...@gmail.com> wrote:
> > >
> > > In addition, would it be useful to be able to change this
> > max_row_group_length
> > > from Python?
> > > Currently that writer property can't be changed from Python, you can only
> > > specify the row_group_size (chunk_size in C++)
> > > when writing a table, but that's currently only useful to set it to
> > > something that is smaller than the max_row_group_length.
> > >
> > > Joris
> >

Reply via email to