Re: [DISCUSS] Write-audit-publish support

Ryan Blue Fri, 08 Nov 2019 15:33:32 -0800

Right now, there isn't a good way to manage multiple pending writes.
Snapshots from each write are created based on the current table state, so
simply moving to one of two pending commits would mean you ignore the
changes in the other pending commit. We've considered adding a
"cherry-pick" operation that can take the changes from one snapshot and
apply them on top of another to solve that problem. If you'd like to
implement that, I'd be happy to review it!


On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta <mehta.ashis...@gmail.com>
wrote:

> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
> stage multiple WAP snapshots, and commit then in any order, based on how
> Audit process work out?
> I wonder this expectation, goes against the underlying principles of
> Iceberg.
>
> Thanks,
> Ashish
>
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Ashish, you can use the rollback table operation to set a particular
>> snapshot as the current table state. Like this:
>>
>> Table table = hiveCatalog.load(name);
>> table.rollback().toSnapshotId(id).commmit();
>>
>>
>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com>
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Can you please help me point to doc, where I can find how to publish a
>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>> summary of Snapshot, but clueless the official recommendation on
>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>> files, but don't know the nuances of missing something important with this.
>>>
>>> Thanks,
>>> -Ashish
>>>
>>>
>>>> ---------- Forwarded message ---------
>>>> From: Ryan Blue <rb...@netflix.com.invalid>
>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com>
>>>> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi <
>>>> aokolnyc...@apple.com>
>>>>
>>>>
>>>> Hi everyone, I've added PR #342
>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>> repository with our WAP changes. Please have a look if you were interested
>>>> in this.
>>>>
>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>> edgar.rodrig...@airbnb.com> wrote:
>>>>
>>>>> I think this use case is pretty helpful in most data environments, we
>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>> One question is, if say the audit part fails, is there a way to expire
>>>>> the snapshot or what would be the workflow that follows?
>>>>>
>>>>> Best,
>>>>> Edgar
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>> moulimukher...@gmail.com> wrote:
>>>>>
>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>> some validation before letting the downstream consume the changes.
>>>>>>
>>>>>> Best,
>>>>>> Mouli
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:
>>>>>>
>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>> looking to associate this support for append-only commits?
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>>> table state. This is documented in the time travel
>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>>> Iceberg site.
>>>>>>>>
>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>>> snapshot to table metadata, but does not make it the current table 
>>>>>>>> state.
>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID 
>>>>>>>> is
>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>>
>>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>
>>>>>>>>> I would also support adding this to Iceberg itself. I think we
>>>>>>>>> have a use case where we can leverage this.
>>>>>>>>>
>>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anton
>>>>>>>>>
>>>>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I think this could be useful. When we ingest data from Kafka, we
>>>>>>>>> do a predefined set of checks on the data. We can potentially utilize
>>>>>>>>> something like this to check for sanity before publishing.
>>>>>>>>>
>>>>>>>>> How is the auditing process suppose to find the new snapshot ,
>>>>>>>>> since it is not accessible from the table. Is it by convention?
>>>>>>>>>
>>>>>>>>> -R
>>>>>>>>>
>>>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <
>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> At Netflix, we have a pattern for building ETL jobs where we
>>>>>>>>>> write data, then audit the result before publishing the data that was
>>>>>>>>>> written to a final table. We call this WAP for write, audit, publish.
>>>>>>>>>>
>>>>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a
>>>>>>>>>> new table snapshot, but doesn’t make that snapshot the current 
>>>>>>>>>> version of
>>>>>>>>>> the table. Instead, a separate process audits the new snapshot and 
>>>>>>>>>> updates
>>>>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure 
>>>>>>>>>> that
>>>>>>>>>> this would be useful anywhere else until we talked to another 
>>>>>>>>>> company this
>>>>>>>>>> week that is interested in the same thing. So I wanted to check 
>>>>>>>>>> whether
>>>>>>>>>> this is a good feature to include in Iceberg itself.
>>>>>>>>>>
>>>>>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>>>>>>> current
>>>>>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>>>>>> that indicates the job is a WAP job. Then any table that has WAP 
>>>>>>>>>> enabled by
>>>>>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>>>>>> snapshot instead of fully committing, with the WAP ID in the 
>>>>>>>>>> snapshot’s
>>>>>>>>>> metadata.
>>>>>>>>>>
>>>>>>>>>> Is this something we should open a PR to add to Iceberg? It seems
>>>>>>>>>> a little strange to make it appear that a commit has succeeded, but 
>>>>>>>>>> not
>>>>>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Filip Bocse
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Edgar Rodriguez
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to