[Discuss] Different file formats for ingestion and compaction

Gabor Kaszab Fri, 25 Oct 2024 08:26:09 -0700

Hey Iceberg Community,

I read this article
<https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg>
the other day and there is this part that caught my attention (amongst
others):
"For high-throughput streaming ingestion, ...  durably store recently
ingested tuples in a row-oriented format and periodically convert them to
Parquet."


So this made me wonder if it makes sense to give some support from the
Iceberg lib for the writers to write different file formats when ingesting
and different when they are compacting. Currently, we have
"write.format.default" to tell the writers what format to use when writing
to the table.
I played with the idea, similarly to the quote above, to choose a format
that is faster to write for streaming ingests and then periodically compact
them into another format that is faster to read. Let's say ingest using
AVRO and compact into Parquet.

Do you think it would make sense to introduce another table property to
split the file format between those use cases? E.g.:
1) Introduce "write.compact.format.default" to tell the writers what format
to use for compactions and use existing "write.format.default" for
everything else.
Or
2) Introduce "write.stream-ingest.format.default" to tell the engines what
format to use for streaming ingest and use the existing
"write.format.default" for everything else?

What do you think?
Gabor

[Discuss] Different file formats for ingestion and compaction

Reply via email to