> I found that the RecordBatchReader reads fewer > rows at a time than each row_group contains, meaning > that a row_group needs to be read twice by > RecordBatchReader. So what is the default batch size > for RecordBatchReader?
There are a few different places a row group could get fragmented. If you are reading from a file directly with parquet::arrow::FileReader then the property is ArrowReaderProperties::batch_size. This defaults to 64Ki rows. If you are reading with a scanner then the scanner has its own batch size (arrow::dataset::ScanOptions::batch_size). The scanner will set ArrowReaderProperties::batch_size to match this value. In 7.0.0 the default is 1Mi rows. In 8.0.0 the default was changed to 128Ki rows. If you are then using that scanner as a source to an execution plan then there is eventually going to be a batch size for the execution plan. It's unclear exactly what the default is (ideally we'd like these batches to fit into L2 cache) and it may not be configurable (for example, it may change dynamically depending on how many columns you have). Instead the sink nodes could probably be extended to support an output batch size but we don't have that yet. > meaning that a row_group needs to be read > twice by RecordBatchReader None of these methods will read the row group twice. Any time a batch size is smaller than the physical row group size we read the row group once and slice it (this slicing is a zero-copy, fairly fast operation). > Also, any good advice if I have to follow the row_group? Set the batch size to a really large value and don't use the execution engine. Do you have a use case for trying to maintain the row group size? Large row groups sizes sometimes make sense for storage. However, for in-memory processing, a large batch size usually leads to inefficient use of cache and RAM. > Finally, is there a performance difference > between ToTable() and ReadNext()? As Aldrin pointed out, ToTable is going to buffer the entire file in memory before it returns one table with all the contents of the file. When working with large data this is usually not a good approach. However, for smaller datasets, it can be a nice convenience. ReadNext should always be as fast, or faster, than ToTable. However, this is assuming that you are processing the data quickly enough. When you create a RecordBatchReader from a scanner then background threads will start working to fill a queue up with data. Each call to ReadNext will grab an item from that queue and make room for more data to be read. If you don't call ReadNext often enough then the queue will fill up and the reader will pause. On Mon, Apr 25, 2022 at 11:59 AM Aldrin <akmon...@ucsc.edu.invalid> wrote: > > I'm guessing that the default batch size is 65536 rows (64 * 1024) [1]. > > I don't have any advice on this at the moment, I haven't looked through the > dataset interface very much. > > If you're using Scanner::ToTable, then there's a note that ToTable "fully > materializes the Scan result in memory" first [2]. If you're talking about > if you were to call ReadNext to materialize the data, then construct the > table as ToTable() does, then I am pretty sure that ToTable() essentially > just calls ReadNext(). I haven't been able to find the spot in the code > that verifies this, though. > > [1]: > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/parquet/properties.h#L556 > [2]: > https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset7Scanner7ToTableEv > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Mon, Apr 25, 2022 at 3:05 AM 1057445597 <1057445...@qq.com.invalid> > wrote: > > > I found that the RecordBatchReader reads fewer rows at a time than each > > row_group contains, meaning that a row_group needs to be read twice by > > RecordBatchReader. So what is the default batch size for > > RecordBatchReader? > > > > > > Also, any good advice if I have to follow the row_group? I have a lot of > > parquet files stored on S3, and if I convert scanner to BatchRecordReader, > > I just loop ReadNext(), and if I want to read row_group, I find, I have to > > call `auto Fragments dataset->GetFragments()`,then iterate through > > fragments and call SplitByRowGroups() to split each fragment again, The > > scanner is then constructed for each fragment divided and the scanner's > > ToTable() is called to read the data. > > > > > > Finally, is there a performance difference between ToTable() and > > ReadNext()?