1. In dataset, it might have `fragment_readahead` or other. 2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See `FileReaderImpl::GetRecordBatchReader`) 3. In Parquet, if non-buffered read is enabled, when read a column, the whole ColumChunk would be read. Otherwise, it will "buffered" read it decided by buffer-size
Maybe I forgot someplaces. You can try to check that. Best Xuwei Fu Li Jin <ice.xell...@gmail.com> 于2023年9月7日周四 00:16写道: > Thanks both for the quick response! I wonder if there is some code in > parquet cpp that might be keeping some cached information (perhaps > metadata) per file scanned? > > On Wed, Sep 6, 2023 at 12:10 PM wish maple <maplewish...@gmail.com> wrote: > > > I've met lots of Parquet Dataset issues. The main problem is that > currently > > we have 2 sets or API > > and they have different scan-options. And sometimes different interfaces > > like `to_batches()` or > > others would enable different scan options. > > > > I think [2] is similar to your problem. 1-4 are some issues I met before. > > > > As for the code, you may take a look at : > > 1. ParquetFileFormat and Dataset related. > > 2. FileSystem and CacheRange. Parquet might use this to handle pre-buffer > > 3. How Parquet RowReader handle IO > > > > [1] https://github.com/apache/arrow/issues/36765 > > [2] https://github.com/apache/arrow/issues/37139 > > [3] https://github.com/apache/arrow/issues/36587 > > [4] https://github.com/apache/arrow/issues/37136 > > > > Li Jin <ice.xell...@gmail.com> 于2023年9月6日周三 23:56写道: > > > > > Hello, > > > > > > I have been testing "What is the max rss needed to scan through ~100G > of > > > data in a parquet stored in gcs using Arrow C++". > > > > > > The current answer is about ~6G of memory which seems a bit high so I > > > looked into it. What I observed during the process led me to think that > > > there are some potential cache/memory issues in the dataset/parquet cpp > > > code. > > > > > > Main observation: > > > (1) As I am scanning through the dataset, I printed out (a) memory > > > allocated by the memory pool from ScanOptions (b) process rss. I found > > that > > > while (a) stays pretty stable throughout the scan (stays < 1G), (b) > keeps > > > increasing during the scan (looks linear to the number of files > scanned). > > > (2) I tested ScanNode in Arrow as well as an in-house library that > > > implements its own "S3Dataset" similar to Arrow dataset, both showing > > > similar rss usage. (Which led me to think the issue is more likely to > be > > in > > > the parquet cpp code instead of dataset code). > > > (3) Scan the same dataset twice in the same process doesn't increase > the > > > max rss. > > > > > > I plan to look into the parquet cpp/dataset code but I wonder if > someone > > > has some clues what the issue might be or where to look at? > > > > > > Thanks, > > > Li > > > > > >