Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Micah Kornfield Sat, 19 Sep 2020 21:08:58 -0700

Hi Niklas,
Two suggestions:
* Try to adjust row_group_size on write_table [1] to a smaller then default
value.  If I read the code correctly this is currently 64 million rows [2],
which seems potentially two high as a default (I'll open a JIRA about this).
* If this is on linux/mac try setting the jemalloc decay which can return
memory the the OS more quickly [3]


Just to confirm this is a local disk (not a blob store?) that you are
writing to?

If you can produce a minimal example that still seems to hold onto all
memory, after trying these two items please open a JIRA as there could be a
bug or some unexpected buffering happening.
<https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156>

Thanks,
Micah

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write_table
[2]
https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156
[3]
https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156

On Tue, Sep 15, 2020 at 8:46 AM Niklas B <niklas.biv...@enplore.com> wrote:

> First of all: Thank you so much for all hard work on Arrow, it’s an
> awesome project.
>
> Hi,
>
> I'm trying to write a large parquet file onto disk (larger then memory)
> using PyArrows ParquetWriter and write_table, but even though the file is
> written incrementally to disk it still appears to keeps the entire dataset
> in memory (eventually getting OOM killed). Basically what I am trying to do
> is:
>
> with pq.ParquetWriter(
>                 output_file,
>                 arrow_schema,
>                 compression='snappy',
>                 allow_truncated_timestamps=True,
>                 version='2.0',  # Highest available schema
>                 data_page_version='2.0',  # Highest available schema
>         ) as writer:
>             for rows_dataframe in function_that_yields_data():
>                 writer.write_table(
>                     pa.Table.from_pydict(
>                             rows_dataframe,
>                             arrow_schema
>                     )
>                 )
>
> Where I have a function that yields data and then write it in chunks using
> write_table.
>
> Is it possible to force the ParquetWriter to not keep the entire dataset
> in memory, or is it simply not possible for good reasons?
>
> I’m streaming data from a database and writes it to Parquet. The
> end-consumer has plenty of ram, but the machine that does the conversion
> doesn’t.
>
> Regards,
> Niklas
>
> PS: I’ve also created a stack overflow question, which I will update with
> any answer I might get from the mailing list
>
> https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

Reply via email to