I would also support adding this to Iceberg itself. I think we have a use case where we can leverage this.
@Ryan, could you also provide more info on the audit process? Thanks, Anton > On 20 Jul 2019, at 04:01, RD <[email protected]> wrote: > > I think this could be useful. When we ingest data from Kafka, we do a > predefined set of checks on the data. We can potentially utilize something > like this to check for sanity before publishing. > > How is the auditing process suppose to find the new snapshot , since it is > not accessible from the table. Is it by convention? > > -R > > On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected]> wrote: > Hi everyone, > > At Netflix, we have a pattern for building ETL jobs where we write data, then > audit the result before publishing the data that was written to a final > table. We call this WAP for write, audit, publish. > > We’ve added support in our Iceberg branch. A WAP write creates a new table > snapshot, but doesn’t make that snapshot the current version of the table. > Instead, a separate process audits the new snapshot and updates the table’s > current snapshot when the audits succeed. I wasn’t sure that this would be > useful anywhere else until we talked to another company this week that is > interested in the same thing. So I wanted to check whether this is a good > feature to include in Iceberg itself. > > This works by staging a snapshot. Basically, Spark writes data as expected, > but Iceberg detects that it should not update the table’s current stage. That > happens when there is a Spark property, spark.wap.id <http://spark.wap.id/>, > that indicates the job is a WAP job. Then any table that has WAP enabled by > the table property write.wap.enabled=true will stage the new snapshot instead > of fully committing, with the WAP ID in the snapshot’s metadata. > > Is this something we should open a PR to add to Iceberg? It seems a little > strange to make it appear that a commit has succeeded, but not actually > change a table, which is why we didn’t submit it before now. > > Thanks, > > rb > > -- > Ryan Blue > Software Engineer > Netflix
