For the record: this has been opened on JIRA as ARROW-10052.
Here is the analysis I posted there (pasted): I'm not sure there's anything surprising. Running this thing a bit (in debug mode!), I see that RSS usage grows by 500-1000 bytes for each column chunk (that is, each column in a row group). This seems to be simply the Parquet file metadata accumulating before it can be written at the end (when the ParquetWriter is closed). format::FileMetadata has a vector of format::RowGroup (one per row group). format::RowGroup has a vector of format::Column (one per column). Each format::Column holds non-trivial information: file name, column metadata (itself potentially large). So, basically you should write only large row groups to Parquet files. Writing 100 rows at a time makes the Parquet format completely inadequate. Replace that with at least 10000 or 100000 rows, IMHO. And, regardless of the row group size, if you grow a Parquet file by keeping the ParquetWriter in memory, memory usage will also grow a bit because of this accumulating file metadata. I see no way around that, other than avoiding this appending pattern. Regards Antoine. On Tue, 15 Sep 2020 17:46:14 +0200 Niklas B <niklas.biv...@enplore.com> wrote: > First of all: Thank you so much for all hard work on Arrow, it’s an awesome > project. > > Hi, > > I'm trying to write a large parquet file onto disk (larger then memory) using > PyArrows ParquetWriter and write_table, but even though the file is written > incrementally to disk it still appears to keeps the entire dataset in memory > (eventually getting OOM killed). Basically what I am trying to do is: > > with pq.ParquetWriter( > output_file, > arrow_schema, > compression='snappy', > allow_truncated_timestamps=True, > version='2.0', # Highest available schema > data_page_version='2.0', # Highest available schema > ) as writer: > for rows_dataframe in function_that_yields_data(): > writer.write_table( > pa.Table.from_pydict( > rows_dataframe, > arrow_schema > ) > ) > > Where I have a function that yields data and then write it in chunks using > write_table. > > Is it possible to force the ParquetWriter to not keep the entire dataset in > memory, or is it simply not possible for good reasons? > > I’m streaming data from a database and writes it to Parquet. The end-consumer > has plenty of ram, but the machine that does the conversion doesn’t. > > Regards, > Niklas > > PS: I’ve also created a stack overflow question, which I will update with any > answer I might get from the mailing list > https://stackoverflow.com/questions/63891231/pyarrow-incrementally-using-parquetwriter-without-keeping-entire-dataset-in-mem