We've been building pipelines that write to Iceberg tables from Flink. Right now, we have applications deployed across 3 AWS regions and have them committing every 10 minutes. We also have an application that monitors the tables and moves files from remote regions into the region where we run our Hadoop clusters, and one that is automatically merging small files in the background.
What we see is commits about every 2-3 minutes on average, with some periodic conflicts when writes across regions happen at the same time. The minimum number of seconds between commits is 1 and it isn't uncommon to see commits less than 10 seconds apart. My interpretation of this is that commit retries for appends are reasonably fast -- fast enough to support streaming writes every few minutes apart. I think these stats mean that we could definitely support structured streaming use cases. And, we could also use a table's snapshot history to support reading from an Iceberg table as a streaming source. On Mon, May 6, 2019 at 4:18 AM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > Hi, > > I would like to discuss the support for micro-batch streaming in Iceberg. > > First of all, do we think micro-batch use-cases are appropriate for > Iceberg? What do we consider as "slow-moving data"? Do we want to support > batch intervals of 30s? What about intervals of 1m/2m/5m? The latter seems > doable as we already have FastAppend and other optimizations in place. > > In particular, I am interested in Spark structured streaming. I have a few > things I want to discuss, but let's confirm it is appropriate for Iceberg. > > Thanks, > Anton > -- Ryan Blue Software Engineer Netflix