agree with Ryan. Engines usually provide override capability that allows users to choose a different write format (than table default) if needed.
There are many production use cases that write columnar formats (like Parquet) in streaming ingestion. I don't necessarily agree that it will be common to have separate file formats for streaming ingestion. Ryan mentioned co-locations/clustering. There could also be Parquet tunings for memory footprint. On Fri, Oct 25, 2024 at 11:56 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > Gabor, > > The reason why the write format is a "default" is that I intended for it > to be something that engines could override. For cases where it doesn't > make sense to use the default because of memory pressure (as you might see > in ingestion processes) you could choose to override and use a format that > fits better with the use case. Then data services could go and compact into > a better long-term format. > > We could also do what you're saying and introduce a property so that > streaming or ingest writers are specifically allowed to write with a > row-oriented format (Avro). I'm not sure how much value there is here, > though. It seems that most processing engines are more able now to > co-locate records and the number of open columnar files is no longer a > pressing concern. > > Ryan > > On Fri, Oct 25, 2024 at 8:26 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Hey Iceberg Community, >> >> I read this article >> <https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg> >> the other day and there is this part that caught my attention (amongst >> others): >> "For high-throughput streaming ingestion, ... durably store recently >> ingested tuples in a row-oriented format and periodically convert them to >> Parquet." >> >> So this made me wonder if it makes sense to give some support from the >> Iceberg lib for the writers to write different file formats when ingesting >> and different when they are compacting. Currently, we have >> "write.format.default" to tell the writers what format to use when writing >> to the table. >> I played with the idea, similarly to the quote above, to choose a format >> that is faster to write for streaming ingests and then periodically compact >> them into another format that is faster to read. Let's say ingest using >> AVRO and compact into Parquet. >> >> Do you think it would make sense to introduce another table property to >> split the file format between those use cases? E.g.: >> 1) Introduce "write.compact.format.default" to tell the writers what >> format to use for compactions and use existing "write.format.default" for >> everything else. >> Or >> 2) Introduce "write.stream-ingest.format.default" to tell the engines >> what format to use for streaming ingest and use the existing >> "write.format.default" for everything else? >> >> What do you think? >> Gabor >> >>