That’s good news, Ryan. Your observations are also aligned with some benchmarks I performed earlier.
I am also wondering whether it makes sense to have a config that limits the number of snapshot we want to track. This config can be based on the number of snapshots (e.g. keep only 10000 snapshots) or based on time (e.g. keep snapshots for the last 7 days). We can implement both, actually. AFAIK, the expiration of snapshots is manual right now. Would it make sense to control this via config options or do we expect that users do this? Let’s talk about providing a sink for structured streaming. Spark provides queryId and epochId/batchId to all sinks, which must ensure that all writes are idempotent. Spark might try to commit the same batch multiple times. So, we need to know the latest committed batchId for every query. One option is to store this information in the table metadata. However, this breaks time traveling and rollbacks. We need to have this mapping per snapshot. Snapshot summary seems like a reasonable choice. Would it make sense to do smth similar to “total-records” and “total-files” to keep the latest committed batch id for each query? Any other ideas are welcome. Thanks, Anton > On 6 May 2019, at 20:09, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > We've been building pipelines that write to Iceberg tables from Flink. Right > now, we have applications deployed across 3 AWS regions and have themTh > committing every 10 minutes. We also have an application that monitors the > tables and moves files from remote regions into the region where we run our > Hadoop clusters, and one that is automatically merging small files in the > background. > > What we see is commits about every 2-3 minutes on average, with some periodic > conflicts when writes across regions happen at the same time. The minimum > number of seconds between commits is 1 and it isn't uncommon to see commits > less than 10 seconds apart. My interpretation of this is that commit retries > for appends are reasonably fast -- fast enough to support streaming writes > every few minutes apart. > > I think these stats mean that we could definitely support structured > streaming use cases. And, we could also use a table's snapshot history to > support reading from an Iceberg table as a streaming source. > > On Mon, May 6, 2019 at 4:18 AM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > Hi, > > I would like to discuss the support for micro-batch streaming in Iceberg. > > First of all, do we think micro-batch use-cases are appropriate for Iceberg? > What do we consider as "slow-moving data"? Do we want to support batch > intervals of 30s? What about intervals of 1m/2m/5m? The latter seems doable > as we already have FastAppend and other optimizations in place. > > In particular, I am interested in Spark structured streaming. I have a few > things I want to discuss, but let's confirm it is appropriate for Iceberg. > > Thanks, > Anton > > > -- > Ryan Blue > Software Engineer > Netflix