Re: [DISCUSS] Write-audit-publish support

Ryan Blue Fri, 08 Nov 2019 14:44:29 -0800

Ashish, you can use the rollback table operation to set a particular
snapshot as the current table state. Like this:


Table table = hiveCatalog.load(name);
table.rollback().toSnapshotId(id).commmit();


On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com>
wrote:

> Hi Ryan,
>
> Can you please help me point to doc, where I can find how to publish a WAP
> snapshot? I am able to filter the snapshot, based on wap.id in summary of
> Snapshot, but clueless the official recommendation on committing that
> snapshot. I can think of cherry-picking Appended/Deleted files, but don't
> know the nuances of missing something important with this.
>
> Thanks,
> -Ashish
>
>
>> ---------- Forwarded message ---------
>> From: Ryan Blue <rb...@netflix.com.invalid>
>> Date: Wed, Jul 31, 2019 at 4:41 PM
>> Subject: Re: [DISCUSS] Write-audit-publish support
>> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com>
>> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi <
>> aokolnyc...@apple.com>
>>
>>
>> Hi everyone, I've added PR #342
>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>> repository with our WAP changes. Please have a look if you were interested
>> in this.
>>
>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>> edgar.rodrig...@airbnb.com> wrote:
>>
>>> I think this use case is pretty helpful in most data environments, we do
>>> the same sort of stage-check-publish pattern to run quality checks.
>>> One question is, if say the audit part fails, is there a way to expire
>>> the snapshot or what would be the workflow that follows?
>>>
>>> Best,
>>> Edgar
>>>
>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>> moulimukher...@gmail.com> wrote:
>>>
>>>> This would be super helpful. We have a similar workflow where we do
>>>> some validation before letting the downstream consume the changes.
>>>>
>>>> Best,
>>>> Mouli
>>>>
>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:
>>>>
>>>>> This definitely sounds interesting. Quick question on whether this
>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>> looking to associate this support for append-only commits?
>>>>>
>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>>> read the WAP snapshot, even though it has not (yet) been the current 
>>>>>> table
>>>>>> state. This is documented in the time travel
>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>> Iceberg site.
>>>>>>
>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>
>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>
>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>> a use case where we can leverage this.
>>>>>>>
>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton
>>>>>>>
>>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>>>>>
>>>>>>> I think this could be useful. When we ingest data from Kafka, we do
>>>>>>> a predefined set of checks on the data. We can potentially utilize
>>>>>>> something like this to check for sanity before publishing.
>>>>>>>
>>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>>
>>>>>>> -R
>>>>>>>
>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>>> data, then audit the result before publishing the data that was 
>>>>>>>> written to
>>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>>
>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>> new table snapshot, but doesn’t make that snapshot the current version 
>>>>>>>> of
>>>>>>>> the table. Instead, a separate process audits the new snapshot and 
>>>>>>>> updates
>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure 
>>>>>>>> that
>>>>>>>> this would be useful anywhere else until we talked to another company 
>>>>>>>> this
>>>>>>>> week that is interested in the same thing. So I wanted to check whether
>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>
>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>>>>> current
>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP 
>>>>>>>> enabled by
>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>>>> metadata.
>>>>>>>>
>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> rb
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Filip Bocse
>>>>>
>>>>
>>>
>>> --
>>> Edgar Rodriguez
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to