Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Antoine Pitrou Thu, 24 Sep 2020 03:50:37 -0700


For the record: this has been opened on JIRA as ARROW-10052.

Here is the analysis I posted there (pasted):

I'm not sure there's anything surprising. Running this thing a bit (in
debug mode!), I see that RSS usage grows by 500-1000 bytes for each
column chunk (that is, each column in a row group).

This seems to be simply the Parquet file metadata accumulating before
it can be written at the end (when the ParquetWriter is closed).
format::FileMetadata has a vector of format::RowGroup (one per row
group). format::RowGroup has a vector of format::Column (one per
column). Each format::Column holds non-trivial information: file name,
column metadata (itself potentially large).

So, basically you should write only large row groups to Parquet files.
Writing 100 rows at a time makes the Parquet format completely
inadequate. Replace that with at least 10000 or 100000 rows, IMHO.

And, regardless of the row group size, if you grow a Parquet file by
keeping the ParquetWriter in memory, memory usage will also grow a bit
because of this accumulating file metadata. I see no way around that,
other than avoiding this appending pattern.

Regards

Antoine.

On Tue, 15 Sep 2020 17:46:14 +0200
Niklas B <niklas.biv...@enplore.com> wrote:
> First of all: Thank you so much for all hard work on Arrow, it’s an awesome 
> project. 
> 
> Hi,
> 
> I'm trying to write a large parquet file onto disk (larger then memory) using 
> PyArrows ParquetWriter and write_table, but even though the file is written 
> incrementally to disk it still appears to keeps the entire dataset in memory 
> (eventually getting OOM killed). Basically what I am trying to do is:
> 
> with pq.ParquetWriter(
>                 output_file,
>                 arrow_schema,
>                 compression='snappy',
>                 allow_truncated_timestamps=True,
>                 version='2.0',  # Highest available schema
>                 data_page_version='2.0',  # Highest available schema
>         ) as writer:
>             for rows_dataframe in function_that_yields_data():
>                 writer.write_table(
>                     pa.Table.from_pydict(
>                             rows_dataframe,
>                             arrow_schema
>                     )
>                 )
> 
> Where I have a function that yields data and then write it in chunks using 
> write_table. 
> 
> Is it possible to force the ParquetWriter to not keep the entire dataset in 
> memory, or is it simply not possible for good reasons?
> 
> I’m streaming data from a database and writes it to Parquet. The end-consumer 
> has plenty of ram, but the machine that does the conversion doesn’t. 
> 
> Regards,
> Niklas
> 
> PS: I’ve also created a stack overflow question, which I will update with any 
> answer I might get from the mailing list 
> https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Reply via email to