Hi Niklas, Two suggestions: * Try to adjust row_group_size on write_table [1] to a smaller then default value. If I read the code correctly this is currently 64 million rows [2], which seems potentially two high as a default (I'll open a JIRA about this). * If this is on linux/mac try setting the jemalloc decay which can return memory the the OS more quickly [3]
Just to confirm this is a local disk (not a blob store?) that you are writing to? If you can produce a minimal example that still seems to hold onto all memory, after trying these two items please open a JIRA as there could be a bug or some unexpected buffering happening. <https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156> Thanks, Micah [1] https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter.write_table [2] https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156 [3] https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156 On Tue, Sep 15, 2020 at 8:46 AM Niklas B <niklas.biv...@enplore.com> wrote: > First of all: Thank you so much for all hard work on Arrow, it’s an > awesome project. > > Hi, > > I'm trying to write a large parquet file onto disk (larger then memory) > using PyArrows ParquetWriter and write_table, but even though the file is > written incrementally to disk it still appears to keeps the entire dataset > in memory (eventually getting OOM killed). Basically what I am trying to do > is: > > with pq.ParquetWriter( > output_file, > arrow_schema, > compression='snappy', > allow_truncated_timestamps=True, > version='2.0', # Highest available schema > data_page_version='2.0', # Highest available schema > ) as writer: > for rows_dataframe in function_that_yields_data(): > writer.write_table( > pa.Table.from_pydict( > rows_dataframe, > arrow_schema > ) > ) > > Where I have a function that yields data and then write it in chunks using > write_table. > > Is it possible to force the ParquetWriter to not keep the entire dataset > in memory, or is it simply not possible for good reasons? > > I’m streaming data from a database and writes it to Parquet. The > end-consumer has plenty of ram, but the machine that does the conversion > doesn’t. > > Regards, > Niklas > > PS: I’ve also created a stack overflow question, which I will update with > any answer I might get from the mailing list > > https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem