Hi Team, We had a use case of ingesting a high frequency of small files into Iceberg. I know based on the documentation I see that Iceberg is for slow-moving data, but the kind of feature around reader/writer isolation, etc Iceberg has, we really need it for our high-frequency small files (even though we can delay their ingestion, it's in our control). *The limitation we hit is, If we have a lot to incoming small files, then all files are tried separately to commit to Iceberg, and we keep failing because of Optimistic concurrency. *
To get rid of this limitation, we have implemented an approach to convert this high-frequency data to low frequency using the following 1. *Buffering based on dynamic rules*: buffering input files, and apply some rules i.e. based on the criticality of data i.e. for instance if the data is critical (based on some field inside the data) and has to be available downstream quickly, so they are committed to Iceberg right away (after every 15 min window). 2. *Eager Compaction: *Also as part of committing after buffering, we are doing eager compaction, to optimize readers. This has worked well so far, but we know this has a pitfall of passing through the data twice, and hence more $$ (which is not a problem, as we anyhow need to compact data). Also as the frequency of data increases, the eager compaction starts to take 20-30 minutes, and hence the critical data doesn't make it to Iceberg for readers, and hence delaying everything for us. *We noticed that work [0], (Thanks Saisai) solves our approach for #2 i.e. we can do async compaction, without having to waste another 20-30min on critical data.* *Clarification: *Is there a way to buffer data on the Iceberg table, without having to commit to the table's state (limitation with optimistic concurrency)? This way, we would have the metadata (i.e. Manifest with stats) for dataFiles available, and in the second run, its a matter of accumulating these dataFiles together doing 1 commit with those buffered dataFiles in it? This will help logically use the *Iceberg** writer* based out of the target table, and generate buffered output, and achieve the use case of high frequency flowing data. IIUC we can't buffer the data outside of Iceberg, because I need to generate stats for dataFiles as Iceberg does, and then committing it will just be a metadata operation, compared to doing a second pass on data to generate those stats again. I am quite hopeful that this can be hacked around on top of Iceberg, but wanted to know the thoughts of community, and see the impediments there on top of people's mind. Also if this might be a good addition to the community itself. NOTE: WAP (write audit publish) does something similar but it does commit to the table so it doesn't help up with limitation with optimistic concurrency. [0]: https://github.com/apache/iceberg/pull/1083 Thanks, Ashish