I'm guessing that the default batch size is 65536 rows (64 * 1024) [1]. I don't have any advice on this at the moment, I haven't looked through the dataset interface very much.
If you're using Scanner::ToTable, then there's a note that ToTable "fully materializes the Scan result in memory" first [2]. If you're talking about if you were to call ReadNext to materialize the data, then construct the table as ToTable() does, then I am pretty sure that ToTable() essentially just calls ReadNext(). I haven't been able to find the spot in the code that verifies this, though. [1]: https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/parquet/properties.h#L556 [2]: https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset7Scanner7ToTableEv Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Apr 25, 2022 at 3:05 AM 1057445597 <1057445...@qq.com.invalid> wrote: > I found that the RecordBatchReader reads fewer rows at a time than each > row_group contains, meaning that a row_group needs to be read twice by > RecordBatchReader. So what is the default batch size for > RecordBatchReader? > > > Also, any good advice if I have to follow the row_group? I have a lot of > parquet files stored on S3, and if I convert scanner to BatchRecordReader, > I just loop ReadNext(), and if I want to read row_group, I find, I have to > call `auto Fragments dataset->GetFragments()`,then iterate through > fragments and call SplitByRowGroups() to split each fragment again, The > scanner is then constructed for each fragment divided and the scanner's > ToTable() is called to read the data. > > > Finally, is there a performance difference between ToTable() and > ReadNext()?