First of all: Thank you so much for all hard work on Arrow, it’s an awesome project.
Hi, I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps the entire dataset in memory (eventually getting OOM killed). Basically what I am trying to do is: with pq.ParquetWriter( output_file, arrow_schema, compression='snappy', allow_truncated_timestamps=True, version='2.0', # Highest available schema data_page_version='2.0', # Highest available schema ) as writer: for rows_dataframe in function_that_yields_data(): writer.write_table( pa.Table.from_pydict( rows_dataframe, arrow_schema ) ) Where I have a function that yields data and then write it in chunks using write_table. Is it possible to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons? I’m streaming data from a database and writes it to Parquet. The end-consumer has plenty of ram, but the machine that does the conversion doesn’t. Regards, Niklas PS: I’ve also created a stack overflow question, which I will update with any answer I might get from the mailing list https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem