Ashish, you can use the rollback table operation to set a particular snapshot as the current table state. Like this:
Table table = hiveCatalog.load(name); table.rollback().toSnapshotId(id).commmit(); On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta <mehta.ashis...@gmail.com> wrote: > Hi Ryan, > > Can you please help me point to doc, where I can find how to publish a WAP > snapshot? I am able to filter the snapshot, based on wap.id in summary of > Snapshot, but clueless the official recommendation on committing that > snapshot. I can think of cherry-picking Appended/Deleted files, but don't > know the nuances of missing something important with this. > > Thanks, > -Ashish > > >> ---------- Forwarded message --------- >> From: Ryan Blue <rb...@netflix.com.invalid> >> Date: Wed, Jul 31, 2019 at 4:41 PM >> Subject: Re: [DISCUSS] Write-audit-publish support >> To: Edgar Rodriguez <edgar.rodrig...@airbnb.com> >> Cc: Iceberg Dev List <dev@iceberg.apache.org>, Anton Okolnychyi < >> aokolnyc...@apple.com> >> >> >> Hi everyone, I've added PR #342 >> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg >> repository with our WAP changes. Please have a look if you were interested >> in this. >> >> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez < >> edgar.rodrig...@airbnb.com> wrote: >> >>> I think this use case is pretty helpful in most data environments, we do >>> the same sort of stage-check-publish pattern to run quality checks. >>> One question is, if say the audit part fails, is there a way to expire >>> the snapshot or what would be the workflow that follows? >>> >>> Best, >>> Edgar >>> >>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee < >>> moulimukher...@gmail.com> wrote: >>> >>>> This would be super helpful. We have a similar workflow where we do >>>> some validation before letting the downstream consume the changes. >>>> >>>> Best, >>>> Mouli >>>> >>>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote: >>>> >>>>> This definitely sounds interesting. Quick question on whether this >>>>> presents impact on the current Upserts spec? Or is it maybe that we are >>>>> looking to associate this support for append-only commits? >>>>> >>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Audits run on the snapshot by setting the snapshot-id read option to >>>>>> read the WAP snapshot, even though it has not (yet) been the current >>>>>> table >>>>>> state. This is documented in the time travel >>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the >>>>>> Iceberg site. >>>>>> >>>>>> We added a stageOnly method to SnapshotProducer that adds the >>>>>> snapshot to table metadata, but does not make it the current table state. >>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is >>>>>> embedded in the staged snapshot’s metadata so processes can find it. >>>>>> >>>>>> I'll add a PR with this code, since there is interest. >>>>>> >>>>>> rb >>>>>> >>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi < >>>>>> aokolnyc...@apple.com> wrote: >>>>>> >>>>>>> I would also support adding this to Iceberg itself. I think we have >>>>>>> a use case where we can leverage this. >>>>>>> >>>>>>> @Ryan, could you also provide more info on the audit process? >>>>>>> >>>>>>> Thanks, >>>>>>> Anton >>>>>>> >>>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote: >>>>>>> >>>>>>> I think this could be useful. When we ingest data from Kafka, we do >>>>>>> a predefined set of checks on the data. We can potentially utilize >>>>>>> something like this to check for sanity before publishing. >>>>>>> >>>>>>> How is the auditing process suppose to find the new snapshot , since >>>>>>> it is not accessible from the table. Is it by convention? >>>>>>> >>>>>>> -R >>>>>>> >>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write >>>>>>>> data, then audit the result before publishing the data that was >>>>>>>> written to >>>>>>>> a final table. We call this WAP for write, audit, publish. >>>>>>>> >>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a >>>>>>>> new table snapshot, but doesn’t make that snapshot the current version >>>>>>>> of >>>>>>>> the table. Instead, a separate process audits the new snapshot and >>>>>>>> updates >>>>>>>> the table’s current snapshot when the audits succeed. I wasn’t sure >>>>>>>> that >>>>>>>> this would be useful anywhere else until we talked to another company >>>>>>>> this >>>>>>>> week that is interested in the same thing. So I wanted to check whether >>>>>>>> this is a good feature to include in Iceberg itself. >>>>>>>> >>>>>>>> This works by staging a snapshot. Basically, Spark writes data as >>>>>>>> expected, but Iceberg detects that it should not update the table’s >>>>>>>> current >>>>>>>> stage. That happens when there is a Spark property, spark.wap.id, >>>>>>>> that indicates the job is a WAP job. Then any table that has WAP >>>>>>>> enabled by >>>>>>>> the table property write.wap.enabled=true will stage the new >>>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s >>>>>>>> metadata. >>>>>>>> >>>>>>>> Is this something we should open a PR to add to Iceberg? It seems a >>>>>>>> little strange to make it appear that a commit has succeeded, but not >>>>>>>> actually change a table, which is why we didn’t submit it before now. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> rb >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>>> -- >>>>> Filip Bocse >>>>> >>>> >>> >>> -- >>> Edgar Rodriguez >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix