Audits run on the snapshot by setting the snapshot-id read option to read
the WAP snapshot, even though it has not (yet) been the current table
state. This is documented in the time travel
<http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site.

We added a stageOnly method to SnapshotProducer that adds the snapshot to
table metadata, but does not make it the current table state. That is
called by the Spark writer when there is a WAP ID, and that ID is embedded
in the staged snapshot’s metadata so processes can find it.

I'll add a PR with this code, since there is interest.

rb

On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <[email protected]>
wrote:

> I would also support adding this to Iceberg itself. I think we have a use
> case where we can leverage this.
>
> @Ryan, could you also provide more info on the audit process?
>
> Thanks,
> Anton
>
> On 20 Jul 2019, at 04:01, RD <[email protected]> wrote:
>
> I think this could be useful. When we ingest data from Kafka, we do a
> predefined set of checks on the data. We can potentially utilize something
> like this to check for sanity before publishing.
>
> How is the auditing process suppose to find the new snapshot , since it is
> not accessible from the table. Is it by convention?
>
> -R
>
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> At Netflix, we have a pattern for building ETL jobs where we write data,
>> then audit the result before publishing the data that was written to a
>> final table. We call this WAP for write, audit, publish.
>>
>> We’ve added support in our Iceberg branch. A WAP write creates a new
>> table snapshot, but doesn’t make that snapshot the current version of the
>> table. Instead, a separate process audits the new snapshot and updates the
>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>> would be useful anywhere else until we talked to another company this week
>> that is interested in the same thing. So I wanted to check whether this is
>> a good feature to include in Iceberg itself.
>>
>> This works by staging a snapshot. Basically, Spark writes data as
>> expected, but Iceberg detects that it should not update the table’s current
>> stage. That happens when there is a Spark property, spark.wap.id, that
>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>> table property write.wap.enabled=true will stage the new snapshot
>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>
>> Is this something we should open a PR to add to Iceberg? It seems a
>> little strange to make it appear that a commit has succeeded, but not
>> actually change a table, which is why we didn’t submit it before now.
>>
>> Thanks,
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to