Gabor,

The reason why the write format is a "default" is that I intended for it to
be something that engines could override. For cases where it doesn't make
sense to use the default because of memory pressure (as you might see in
ingestion processes) you could choose to override and use a format that
fits better with the use case. Then data services could go and compact into
a better long-term format.

We could also do what you're saying and introduce a property so that
streaming or ingest writers are specifically allowed to write with a
row-oriented format (Avro). I'm not sure how much value there is here,
though. It seems that most processing engines are more able now to
co-locate records and the number of open columnar files is no longer a
pressing concern.

Ryan

On Fri, Oct 25, 2024 at 8:26 AM Gabor Kaszab <gaborkas...@apache.org> wrote:

> Hey Iceberg Community,
>
> I read this article
> <https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg>
> the other day and there is this part that caught my attention (amongst
> others):
> "For high-throughput streaming ingestion, ...  durably store recently
> ingested tuples in a row-oriented format and periodically convert them to
> Parquet."
>
> So this made me wonder if it makes sense to give some support from the
> Iceberg lib for the writers to write different file formats when ingesting
> and different when they are compacting. Currently, we have
> "write.format.default" to tell the writers what format to use when writing
> to the table.
> I played with the idea, similarly to the quote above, to choose a format
> that is faster to write for streaming ingests and then periodically compact
> them into another format that is faster to read. Let's say ingest using
> AVRO and compact into Parquet.
>
> Do you think it would make sense to introduce another table property to
> split the file format between those use cases? E.g.:
> 1) Introduce "write.compact.format.default" to tell the writers what
> format to use for compactions and use existing "write.format.default" for
> everything else.
> Or
> 2) Introduce "write.stream-ingest.format.default" to tell the engines what
> format to use for streaming ingest and use the existing
> "write.format.default" for everything else?
>
> What do you think?
> Gabor
>
>

Reply via email to