Writing very large rowgroups to Apache Parquet

Roman Karlstetter Thu, 09 Jul 2020 08:36:18 -0700

Hi everyone,

since some time now, parquet::ParquetFileWriter has the option to create
buffered rowgroups with AppendBufferedRowGroup(), which basically gives you
the possibility to write to columns in any order you like (in contrast to
the former only possible way of writing one column after the other). This
is cool since it avoids the caller from having to create an in memory
columnar representation of its data.


However, when data size is huge compared to the available system memory
(due to wide schema or a large rowgroupsize), this is problematic, as the
buffers allocated internally can take up a large portion of RAM of the
machine the conversion is running on.

One way to solve that problem would be to use memory mapped files instead
of plain memory buffers. That way, the number of required memory can be
limited by the number of columns times the os-pagesize, which would be
independent of the rowgroup-size. Consequently, large rowgroupsizes pose no
problem with respect to RAM consumption.

I wonder what you generally think about the idea of integrating an
AppendFileBufferedRowGroup() (or similar name) possibility which gives the
user the option to have the internal buffers be memory mapped files.

After a quick look at how the buffers are managed inside arrow (allocated
from a default memory pool), I have the impression that an implementation
of this idea could be a rather huge change. I still wanted to know whether
that is something you could see being integrated or whether that is out of
scope of arrow.

Thanks in advance and kind regards,
Roman

Writing very large rowgroups to Apache Parquet

Reply via email to