agree with Ryan. Engines usually provide override capability that allows
users to choose a different write format (than table default) if needed.

There are many production use cases that write columnar formats (like
Parquet) in streaming ingestion. I don't necessarily agree that it will be
common to have separate file formats for streaming ingestion. Ryan
mentioned co-locations/clustering. There could also be Parquet tunings for
memory footprint.




On Fri, Oct 25, 2024 at 11:56 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> Gabor,
>
> The reason why the write format is a "default" is that I intended for it
> to be something that engines could override. For cases where it doesn't
> make sense to use the default because of memory pressure (as you might see
> in ingestion processes) you could choose to override and use a format that
> fits better with the use case. Then data services could go and compact into
> a better long-term format.
>
> We could also do what you're saying and introduce a property so that
> streaming or ingest writers are specifically allowed to write with a
> row-oriented format (Avro). I'm not sure how much value there is here,
> though. It seems that most processing engines are more able now to
> co-locate records and the number of open columnar files is no longer a
> pressing concern.
>
> Ryan
>
> On Fri, Oct 25, 2024 at 8:26 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Hey Iceberg Community,
>>
>> I read this article
>> <https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg>
>> the other day and there is this part that caught my attention (amongst
>> others):
>> "For high-throughput streaming ingestion, ...  durably store recently
>> ingested tuples in a row-oriented format and periodically convert them to
>> Parquet."
>>
>> So this made me wonder if it makes sense to give some support from the
>> Iceberg lib for the writers to write different file formats when ingesting
>> and different when they are compacting. Currently, we have
>> "write.format.default" to tell the writers what format to use when writing
>> to the table.
>> I played with the idea, similarly to the quote above, to choose a format
>> that is faster to write for streaming ingests and then periodically compact
>> them into another format that is faster to read. Let's say ingest using
>> AVRO and compact into Parquet.
>>
>> Do you think it would make sense to introduce another table property to
>> split the file format between those use cases? E.g.:
>> 1) Introduce "write.compact.format.default" to tell the writers what
>> format to use for compactions and use existing "write.format.default" for
>> everything else.
>> Or
>> 2) Introduce "write.stream-ingest.format.default" to tell the engines
>> what format to use for streaming ingest and use the existing
>> "write.format.default" for everything else?
>>
>> What do you think?
>> Gabor
>>
>>

Reply via email to