1. In dataset, it might have `fragment_readahead` or other.
2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See
`FileReaderImpl::GetRecordBatchReader`)
3. In Parquet, if non-buffered read is enabled, when read a column, the
whole ColumChunk would be read.
    Otherwise, it will "buffered" read it decided by buffer-size

Maybe I forgot someplaces. You can try to check that.

Best
Xuwei Fu

Li Jin <ice.xell...@gmail.com> 于2023年9月7日周四 00:16写道:

> Thanks both for the quick response! I wonder if there is some code in
> parquet cpp  that might be keeping some cached information (perhaps
> metadata) per file scanned?
>
> On Wed, Sep 6, 2023 at 12:10 PM wish maple <maplewish...@gmail.com> wrote:
>
> > I've met lots of Parquet Dataset issues. The main problem is that
> currently
> > we have 2 sets or API
> > and they have different scan-options. And sometimes different interfaces
> > like `to_batches()` or
> > others would enable different scan options.
> >
> > I think [2] is similar to your problem. 1-4 are some issues I met before.
> >
> > As for the code, you may take a look at :
> > 1. ParquetFileFormat and Dataset related.
> > 2. FileSystem and CacheRange. Parquet might use this to handle pre-buffer
> > 3. How Parquet RowReader handle IO
> >
> > [1] https://github.com/apache/arrow/issues/36765
> > [2] https://github.com/apache/arrow/issues/37139
> > [3] https://github.com/apache/arrow/issues/36587
> > [4] https://github.com/apache/arrow/issues/37136
> >
> > Li Jin <ice.xell...@gmail.com> 于2023年9月6日周三 23:56写道:
> >
> > > Hello,
> > >
> > > I have been testing "What is the max rss needed to scan through ~100G
> of
> > > data in a parquet stored in gcs using Arrow C++".
> > >
> > > The current answer is about ~6G of memory which seems a bit high so I
> > > looked into it. What I observed during the process led me to think that
> > > there are some potential cache/memory issues in the dataset/parquet cpp
> > > code.
> > >
> > > Main observation:
> > > (1) As I am scanning through the dataset, I printed out (a) memory
> > > allocated by the memory pool from ScanOptions (b) process rss. I found
> > that
> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b)
> keeps
> > > increasing during the scan (looks linear to the number of files
> scanned).
> > > (2) I tested ScanNode in Arrow as well as an in-house library that
> > > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > > similar rss usage. (Which led me to think the issue is more likely to
> be
> > in
> > > the parquet cpp code instead of dataset code).
> > > (3) Scan the same dataset twice in the same process doesn't increase
> the
> > > max rss.
> > >
> > > I plan to look into the parquet cpp/dataset code but I wonder if
> someone
> > > has some clues what the issue might be or where to look at?
> > >
> > > Thanks,
> > > Li
> > >
> >
>

Reply via email to