> Hi Jin,
> Do you have more information about the parquet file?

This is metadata for one file (I scanned about 2000 files  in total)

<pyarrow._parquet.FileMetaData object at 0x7fb885e92ef0>

  created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)

  num_columns: 840

  num_rows: 87382

  num_row_groups: 1

  format_version: 1.0

  serialized_size: 247053


On Wed, Sep 6, 2023 at 12:22 PM wish maple <maplewish...@gmail.com> wrote:

> 1. In dataset, it might have `fragment_readahead` or other.
> 2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See
> `FileReaderImpl::GetRecordBatchReader`)
> 3. In Parquet, if non-buffered read is enabled, when read a column, the
> whole ColumChunk would be read.
>     Otherwise, it will "buffered" read it decided by buffer-size
>
> Maybe I forgot someplaces. You can try to check that.
>
> Best
> Xuwei Fu
>
> Li Jin <ice.xell...@gmail.com> 于2023年9月7日周四 00:16写道:
>
> > Thanks both for the quick response! I wonder if there is some code in
> > parquet cpp  that might be keeping some cached information (perhaps
> > metadata) per file scanned?
> >
> > On Wed, Sep 6, 2023 at 12:10 PM wish maple <maplewish...@gmail.com>
> wrote:
> >
> > > I've met lots of Parquet Dataset issues. The main problem is that
> > currently
> > > we have 2 sets or API
> > > and they have different scan-options. And sometimes different
> interfaces
> > > like `to_batches()` or
> > > others would enable different scan options.
> > >
> > > I think [2] is similar to your problem. 1-4 are some issues I met
> before.
> > >
> > > As for the code, you may take a look at :
> > > 1. ParquetFileFormat and Dataset related.
> > > 2. FileSystem and CacheRange. Parquet might use this to handle
> pre-buffer
> > > 3. How Parquet RowReader handle IO
> > >
> > > [1] https://github.com/apache/arrow/issues/36765
> > > [2] https://github.com/apache/arrow/issues/37139
> > > [3] https://github.com/apache/arrow/issues/36587
> > > [4] https://github.com/apache/arrow/issues/37136
> > >
> > > Li Jin <ice.xell...@gmail.com> 于2023年9月6日周三 23:56写道:
> > >
> > > > Hello,
> > > >
> > > > I have been testing "What is the max rss needed to scan through ~100G
> > of
> > > > data in a parquet stored in gcs using Arrow C++".
> > > >
> > > > The current answer is about ~6G of memory which seems a bit high so I
> > > > looked into it. What I observed during the process led me to think
> that
> > > > there are some potential cache/memory issues in the dataset/parquet
> cpp
> > > > code.
> > > >
> > > > Main observation:
> > > > (1) As I am scanning through the dataset, I printed out (a) memory
> > > > allocated by the memory pool from ScanOptions (b) process rss. I
> found
> > > that
> > > > while (a) stays pretty stable throughout the scan (stays < 1G), (b)
> > keeps
> > > > increasing during the scan (looks linear to the number of files
> > scanned).
> > > > (2) I tested ScanNode in Arrow as well as an in-house library that
> > > > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > > > similar rss usage. (Which led me to think the issue is more likely to
> > be
> > > in
> > > > the parquet cpp code instead of dataset code).
> > > > (3) Scan the same dataset twice in the same process doesn't increase
> > the
> > > > max rss.
> > > >
> > > > I plan to look into the parquet cpp/dataset code but I wonder if
> > someone
> > > > has some clues what the issue might be or where to look at?
> > > >
> > > > Thanks,
> > > > Li
> > > >
> > >
> >
>

Reply via email to