I'm guessing that the default batch size is 65536 rows (64 * 1024) [1].

I don't have any advice on this at the moment, I haven't looked through the
dataset interface very much.

If you're using Scanner::ToTable, then there's a note that ToTable "fully
materializes the Scan result in memory" first [2]. If you're talking about
if you were to call ReadNext to materialize the data, then construct the
table as ToTable() does, then I am pretty sure that ToTable() essentially
just calls ReadNext(). I haven't been able to find the spot in the code
that verifies this, though.

[1]:
https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/parquet/properties.h#L556
[2]:
https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset7Scanner7ToTableEv

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Mon, Apr 25, 2022 at 3:05 AM 1057445597 <1057445...@qq.com.invalid>
wrote:

> I found that the RecordBatchReader reads fewer rows at a time than each
> row_group contains, meaning that a row_group needs to be read twice by
> RecordBatchReader. So what is the default batch size for
> RecordBatchReader?&nbsp;
>
>
> Also, any good advice if I have to follow the row_group? I have a lot of
> parquet files stored on S3, and if I convert scanner to BatchRecordReader,
> I just loop ReadNext(), and if I want to read row_group, I find, I have to
> call `auto Fragments dataset-&gt;GetFragments()`,then iterate through
> fragments and call SplitByRowGroups() to split each fragment again, The
> scanner is then constructed for each fragment divided and the scanner's
> ToTable() is called to read the data.&nbsp;
>
>
> Finally, is there a performance difference between ToTable() and
> ReadNext()?

Reply via email to