Re: what is the default batch size of the RecordBatchReader？

Weston Pace Mon, 25 Apr 2022 15:18:15 -0700

> I found that the RecordBatchReader reads fewer
> rows at a time than each row_group contains, meaning
> that a row_group needs to be read twice by
> RecordBatchReader. So what is the default batch size
> for RecordBatchReader?

There are a few different places a row group could get fragmented.  If
you are reading from a file directly with parquet::arrow::FileReader
then the property is ArrowReaderProperties::batch_size.  This defaults
to 64Ki rows.

If you are reading with a scanner then the scanner has its own batch
size (arrow::dataset::ScanOptions::batch_size).  The scanner will set
ArrowReaderProperties::batch_size to match this value.  In 7.0.0 the
default is 1Mi rows.  In 8.0.0 the default was changed to 128Ki rows.

If you are then using that scanner as a source to an execution plan
then there is eventually going to be a batch size for the execution
plan.  It's unclear exactly what the default is (ideally we'd like
these batches to fit into L2 cache) and it may not be configurable
(for example, it may change dynamically depending on how many columns
you have).  Instead the sink nodes could probably be extended to
support an output batch size but we don't have that yet.

> meaning that a row_group needs to be read
> twice by RecordBatchReader

None of these methods will read the row group twice.  Any time a batch
size is smaller than the physical row group size we read the row group
once and slice it (this slicing is a zero-copy, fairly fast
operation).

> Also, any good advice if I have to follow the row_group?

Set the batch size to a really large value and don't use the execution
engine.  Do you have a use case for trying to maintain the row group
size?  Large row groups sizes sometimes make sense for storage.
However, for in-memory processing, a large batch size usually leads to
inefficient use of cache and RAM.

> Finally, is there a performance difference
> between ToTable() and ReadNext()?

As Aldrin pointed out, ToTable is going to buffer the entire file in
memory before it returns one table with all the contents of the file.
When working with large data this is usually not a good approach.
However, for smaller datasets, it can be a nice convenience.

ReadNext should always be as fast, or faster, than ToTable.  However,
this is assuming that you are processing the data quickly enough.
When you create a RecordBatchReader from a scanner then background
threads will start working to fill a queue up with data.  Each call to
ReadNext will grab an item from that queue and make room for more data
to be read.  If you don't call ReadNext often enough then the queue
will fill up and the reader will pause.

On Mon, Apr 25, 2022 at 11:59 AM Aldrin <akmon...@ucsc.edu.invalid> wrote:
>
> I'm guessing that the default batch size is 65536 rows (64 * 1024) [1].
>
> I don't have any advice on this at the moment, I haven't looked through the
> dataset interface very much.
>
> If you're using Scanner::ToTable, then there's a note that ToTable "fully
> materializes the Scan result in memory" first [2]. If you're talking about
> if you were to call ReadNext to materialize the data, then construct the
> table as ToTable() does, then I am pretty sure that ToTable() essentially
> just calls ReadNext(). I haven't been able to find the spot in the code
> that verifies this, though.
>
> [1]:
> https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/parquet/properties.h#L556
> [2]:
> https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset7Scanner7ToTableEv
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Mon, Apr 25, 2022 at 3:05 AM 1057445597 <1057445...@qq.com.invalid>
> wrote:
>
> > I found that the RecordBatchReader reads fewer rows at a time than each
> > row_group contains, meaning that a row_group needs to be read twice by
> > RecordBatchReader. So what is the default batch size for
> > RecordBatchReader?&nbsp;
> >
> >
> > Also, any good advice if I have to follow the row_group? I have a lot of
> > parquet files stored on S3, and if I convert scanner to BatchRecordReader,
> > I just loop ReadNext(), and if I want to read row_group, I find, I have to
> > call `auto Fragments dataset-&gt;GetFragments()`,then iterate through
> > fragments and call SplitByRowGroups() to split each fragment again, The
> > scanner is then constructed for each fragment divided and the scanner's
> > ToTable() is called to read the data.&nbsp;
> >
> >
> > Finally, is there a performance difference between ToTable() and
> > ReadNext()?

Re: what is the default batch size of the RecordBatchReader？

Reply via email to