I was playing around Flink ingestion performance testing and I have found that the compression codec is also an important factor. Using zstd has much higher write performance, using gzip has higher compression rate.
So I would argue that there are more factors which could be optimized for writing and later optimized by compaction for reading. I would leave this to the engines, as they know better what it is good for them. Maybe a Python writer has no performant zstd compression codec, and it might want to use gzip immediately, maybe a C++ implementation of a Parquet writer is more performant than an Avro writer when the dictionary usage is high... IMHO, the owner of the table should decide what is the target/default configuration, but there might be as many optimal writer configurations as engines/situations, so we shouldn't introduce new configurations for these cases. Thanks, Peter On Fri, Oct 25, 2024, 22:47 Steven Wu <stevenz...@gmail.com> wrote: > agree with Ryan. Engines usually provide override capability that allows > users to choose a different write format (than table default) if needed. > > There are many production use cases that write columnar formats (like > Parquet) in streaming ingestion. I don't necessarily agree that it will be > common to have separate file formats for streaming ingestion. Ryan > mentioned co-locations/clustering. There could also be Parquet tunings for > memory footprint. > > > > > On Fri, Oct 25, 2024 at 11:56 AM rdb...@gmail.com <rdb...@gmail.com> > wrote: > >> Gabor, >> >> The reason why the write format is a "default" is that I intended for it >> to be something that engines could override. For cases where it doesn't >> make sense to use the default because of memory pressure (as you might see >> in ingestion processes) you could choose to override and use a format that >> fits better with the use case. Then data services could go and compact into >> a better long-term format. >> >> We could also do what you're saying and introduce a property so that >> streaming or ingest writers are specifically allowed to write with a >> row-oriented format (Avro). I'm not sure how much value there is here, >> though. It seems that most processing engines are more able now to >> co-locate records and the number of open columnar files is no longer a >> pressing concern. >> >> Ryan >> >> On Fri, Oct 25, 2024 at 8:26 AM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >>> Hey Iceberg Community, >>> >>> I read this article >>> <https://cloud.google.com/blog/products/data-analytics/announcing-bigquery-tables-for-apache-iceberg> >>> the other day and there is this part that caught my attention (amongst >>> others): >>> "For high-throughput streaming ingestion, ... durably store recently >>> ingested tuples in a row-oriented format and periodically convert them to >>> Parquet." >>> >>> So this made me wonder if it makes sense to give some support from the >>> Iceberg lib for the writers to write different file formats when ingesting >>> and different when they are compacting. Currently, we have >>> "write.format.default" to tell the writers what format to use when writing >>> to the table. >>> I played with the idea, similarly to the quote above, to choose a format >>> that is faster to write for streaming ingests and then periodically compact >>> them into another format that is faster to read. Let's say ingest using >>> AVRO and compact into Parquet. >>> >>> Do you think it would make sense to introduce another table property to >>> split the file format between those use cases? E.g.: >>> 1) Introduce "write.compact.format.default" to tell the writers what >>> format to use for compactions and use existing "write.format.default" for >>> everything else. >>> Or >>> 2) Introduce "write.stream-ingest.format.default" to tell the engines >>> what format to use for streaming ingest and use the existing >>> "write.format.default" for everything else? >>> >>> What do you think? >>> Gabor >>> >>>