Re: [DISCUSS] Write-audit-publish support

Miao Wang Mon, 11 Nov 2019 12:16:46 -0800

From a timeline perspective, we can’t work on implementing this feature in next 
a couple of months. For short term workaround, we choose a lock mechanism at 
application level.


@Anton Okolnychyi<mailto:[email protected]> If you can pick up this 
feature, it will be great!

Thanks!

Miao

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>
Date: Monday, November 11, 2019 at 11:54 AM
To: Anton Okolnychyi <[email protected]>
Cc: Iceberg Dev List <[email protected]>, Ashish Mehta 
<[email protected]>
Subject: Re: [DISCUSS] Write-audit-publish support

I just had a direct request for this over the weekend, too. I opened #629 Add 
cherry-pick 
operation<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F629&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904592246&sdata=XsPllVj3l5DZeMDrI248W2timQywQXNjpbRSg9nMppg%3D&reserved=0>
 to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi 
<[email protected]<mailto:[email protected]>> wrote:
We would be interested in this functionality as well. We have a use case with 
multiple concurrent writers where we wanted to use WAP but couldn’t.


On 9 Nov 2019, at 01:32, Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

Right now, there isn't a good way to manage multiple pending writes. Snapshots 
from each write are created based on the current table state, so simply moving 
to one of two pending commits would mean you ignore the changes in the other 
pending commit. We've considered adding a "cherry-pick" operation that can take 
the changes from one snapshot and apply them on top of another to solve that 
problem. If you'd like to implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage 
multiple WAP snapshots, and commit then in any order, based on how Audit 
process work out?
I wonder this expectation, goes against the underlying principles of Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

Ashish, you can use the rollback table operation to set a particular snapshot 
as the current table state. Like this:

Table table = hiveCatalog.load(name);

table.rollback().toSnapshotId(id).commmit();

On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
<[email protected]<mailto:[email protected]>> wrote:
Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP 
snapshot? I am able to filter the snapshot, based on 
wap.id<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwap.id%2F&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247&sdata=UO%2Fc%2Bz2pqZqUrllKAbAsCA%2Bg5B1MnJmF3ysl9JuLqv0%3D&reserved=0>
 in summary of Snapshot, but clueless the official recommendation on committing 
that snapshot. I can think of cherry-picking Appended/Deleted files, but don't 
know the nuances of missing something important with this.

Thanks,
-Ashish

---------- Forwarded message ---------
From: Ryan Blue <[email protected]<mailto:[email protected]>>
Date: Wed, Jul 31, 2019 at 4:41 PM
Subject: Re: [DISCUSS] Write-audit-publish support
To: Edgar Rodriguez 
<[email protected]<mailto:[email protected]>>
Cc: Iceberg Dev List <[email protected]<mailto:[email protected]>>, 
Anton Okolnychyi <[email protected]<mailto:[email protected]>>

Hi everyone, I've added PR 
#342<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F342&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247&sdata=dzwyt0wOGwoIsVDGyUapQn2S%2F%2FhWsxdiBOnfdj2GClA%3D&reserved=0>
 to the Iceberg repository with our WAP changes. Please have a look if you were 
interested in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez 
<[email protected]<mailto:[email protected]>> wrote:
I think this use case is pretty helpful in most data environments, we do the 
same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the 
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
<[email protected]<mailto:[email protected]>> wrote:
This would be super helpful. We have a similar workflow where we do some 
validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip 
<[email protected]<mailto:[email protected]>> wrote:
This definitely sounds interesting. Quick question on whether this presents 
impact on the current Upserts spec? Or is it maybe that we are looking to 
associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

Audits run on the snapshot by setting the snapshot-id read option to read the 
WAP snapshot, even though it has not (yet) been the current table state. This 
is documented in the time 
travel<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23time-travel&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904612234&sdata=ExDJT3WKuggFsStxnoHDFOHBNO7twA%2BODbG44nwKEK8%3D&reserved=0>
 section of the Iceberg site.

We added a stageOnly method to SnapshotProducer that adds the snapshot to table 
metadata, but does not make it the current table state. That is called by the 
Spark writer when there is a WAP ID, and that ID is embedded in the staged 
snapshot’s metadata so processes can find it.

I'll add a PR with this code, since there is interest.

rb

On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
<[email protected]<mailto:[email protected]>> wrote:
I would also support adding this to Iceberg itself. I think we have a use case 
where we can leverage this.

@Ryan, could you also provide more info on the audit process?

Thanks,
Anton


On 20 Jul 2019, at 04:01, RD <[email protected]<mailto:[email protected]>> 
wrote:

I think this could be useful. When we ingest data from Kafka, we do a 
predefined set of checks on the data. We can potentially utilize something like 
this to check for sanity before publishing.

How is the auditing process suppose to find the new snapshot , since it is not 
accessible from the table. Is it by convention?

-R

On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
<[email protected]<mailto:[email protected]>> wrote:

Hi everyone,

At Netflix, we have a pattern for building ETL jobs where we write data, then 
audit the result before publishing the data that was written to a final table. 
We call this WAP for write, audit, publish.

We’ve added support in our Iceberg branch. A WAP write creates a new table 
snapshot, but doesn’t make that snapshot the current version of the table. 
Instead, a separate process audits the new snapshot and updates the table’s 
current snapshot when the audits succeed. I wasn’t sure that this would be 
useful anywhere else until we talked to another company this week that is 
interested in the same thing. So I wanted to check whether this is a good 
feature to include in Iceberg itself.

This works by staging a snapshot. Basically, Spark writes data as expected, but 
Iceberg detects that it should not update the table’s current stage. That 
happens when there is a Spark property, 
spark.wap.id<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.wap.id%2F&data=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904612234&sdata=dfJvCchcsO6lT9tkt9xU5TQD%2BT%2Bnz4GYFqUKZjmQxFo%3D&reserved=0>,
 that indicates the job is a WAP job. Then any table that has WAP enabled by 
the table property write.wap.enabled=true will stage the new snapshot instead 
of fully committing, with the WAP ID in the snapshot’s metadata.

Is this something we should open a PR to add to Iceberg? It seems a little 
strange to make it appear that a commit has succeeded, but not actually change 
a table, which is why we didn’t submit it before now.

Thanks,

rb
--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


--
Filip Bocse


--
Edgar Rodriguez


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to