I think this could be useful. When we ingest data from Kafka, we do a
predefined set of checks on the data. We can potentially utilize something
like this to check for sanity before publishing.

How is the auditing process suppose to find the new snapshot , since it is
not accessible from the table. Is it by convention?

-R

On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> At Netflix, we have a pattern for building ETL jobs where we write data,
> then audit the result before publishing the data that was written to a
> final table. We call this WAP for write, audit, publish.
>
> We’ve added support in our Iceberg branch. A WAP write creates a new table
> snapshot, but doesn’t make that snapshot the current version of the table.
> Instead, a separate process audits the new snapshot and updates the table’s
> current snapshot when the audits succeed. I wasn’t sure that this would be
> useful anywhere else until we talked to another company this week that is
> interested in the same thing. So I wanted to check whether this is a good
> feature to include in Iceberg itself.
>
> This works by staging a snapshot. Basically, Spark writes data as
> expected, but Iceberg detects that it should not update the table’s current
> stage. That happens when there is a Spark property, spark.wap.id, that
> indicates the job is a WAP job. Then any table that has WAP enabled by the
> table property write.wap.enabled=true will stage the new snapshot instead
> of fully committing, with the WAP ID in the snapshot’s metadata.
>
> Is this something we should open a PR to add to Iceberg? It seems a little
> strange to make it appear that a commit has succeeded, but not actually
> change a table, which is why we didn’t submit it before now.
>
> Thanks,
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to