Re: Updates/Deletes/Upserts in Iceberg

2019-05-10 Thread Erik Wright
Thanks for putting this forward. Another term for the "lazy" approach would be "merge on read". My team has built something internally that uses merge-on-read internally but uses an "Eager" materialization for publication to Presto. Roughly, we maintain a table metadata file that looks a bit like

Re: Updates/Deletes/Upserts in Iceberg

2019-05-16 Thread Erik Wright
r insert by mapping the incoming >> keys to existing files for updates. At this point, it seems to rely on join. >> >> Is my understanding correct? If so, do we want to consider joins on >> write? We mentioned this technique as one way to ensure the uniqueness of >> natur

Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Erik Wright
On Thu, May 16, 2019 at 4:13 PM Ryan Blue wrote: > Replies inline. > > On Thu, May 16, 2019 at 10:07 AM Erik Wright > wrote: > >> I would be happy to participate. Iceberg with merge-on-read capabilities >> is a technology choice that my team is actively consider

Re: Updates/Deletes/Upserts in Iceberg

2019-05-21 Thread Erik Wright
oncurrency that are supported (I'm not sure about that, really, just going off of your comment above). > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Tue, May 21, 2019 at 7:54 AM Erik Wright > wrote: > >> On Thu, May 16, 2019 at 4:13 PM Ryan Blue wrote: &g

Re: Updates/Deletes/Upserts in Iceberg

2019-05-22 Thread Erik Wright
> > We have two rows with the same natural key and we use that natural key in > diff files: > nk | col1 | col2 > 1 | 1 | 1 > 1 | 2 | 2 > Then we have a delete statement: > DELETE FROM t WHERE col1 = 1 I think this example cuts to the point of the differences of understanding. Does Iceberg want to

Re: Updates/Deletes/Upserts in Iceberg

2019-05-22 Thread Erik Wright
other constraints (that might not be appropriate to all applications). Myself and Miguel are out on Friday, but Anton should be able to handle the > discussion on our side. > > > Thanks, > Cristian > > > On 22 May 2019, at 17:51, Erik Wright > wrote: > > We have two

Manifest List Files

2019-06-03 Thread Erik Wright
In the process of following up on the "Updates/Deletes/Upserts" thread, I'm re-reading the table spec. I have a question about Manifest List files. If I understand correctly, the manifest list files are separate files that are created prior to attempting to commit a new snapshot. Each snapshot may

Re: Manifest List Files

2019-06-03 Thread Erik Wright
a new PR out to rewrite > manifests to take advantage of this: > https://github.com/apache/incubator-iceberg/pull/200/files > > Does that answer your question? > > On Mon, Jun 3, 2019 at 1:38 PM Erik Wright > wrote: > >> In the process of following up on the "Updates

Re: Manifest List Files

2019-06-03 Thread Erik Wright
ation in the root metadata file. > > On Mon, Jun 3, 2019 at 2:13 PM Erik Wright > wrote: > >> Thanks for the response, Ryan. I can certainly see the benefits of >> manifest files are. I can see that with potentially long lists of valid >> snapshots, each having long

Re: Updates/Deletes/Upserts in Iceberg

2019-06-07 Thread Erik Wright
gt;>>>> that uniqueness cannot be enforced in Iceberg. >>>>> >>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes down >>>>> to how we identify rows that are deleted. If we use (filename, position) >>>>> then we kno

Re: Updates/Deletes/Upserts in Iceberg

2019-06-07 Thread Erik Wright
re care in implementation to handle > some corner cases consistently. > > Let me know if I got the gist of your proposal right ? > > Thanks, > Cristian > > > On 7 Jun 2019, at 19:47, Ryan Blue wrote: > > Thanks, Erik! Great to see progress here. I'll set a

Re: Updates/Deletes/Upserts in Iceberg

2019-06-12 Thread Erik Wright
t snapshot). Can you take a stab at responding to the areas of confusion I highlighted above? Let me know if I need to do some more writing/drawing to describe the incremental read concerns. Yes, after another round or two like this it would likely make sense to have another video conference. At the

Re: Updates/Deletes/Upserts in Iceberg

2019-06-17 Thread Erik Wright
> Finally, Iceberg relies on regular metadata maintenance — manifest > compaction — to reduce both write volume and the number of file reads > needed to plan a scan. Snapshots also reuse metadata files from previous > snapshots to reduce write volume. > > On Wed, Jun 12, 201

Re: Updates/Deletes/Upserts in Iceberg

2019-06-17 Thread Erik Wright
applied to the effective dataset, a consumer can read the entire dataset as of any version or incrementally observe changes to the dataset from version to version. And the same API can be used to consume the dataset regardless of which method was used to produce it. Thanks, > Anton > > On 17 Ju

Re: Updates/Deletes/Upserts in Iceberg

2019-06-17 Thread Erik Wright
On Mon, Jun 17, 2019 at 5:51 PM Erik Wright wrote: > That's a really insightful summary of the different proposals that have > been made. My reading of Ryan's suggestion pointed to ways that both > approaches can be supported. The precise mechanisms available > for re

Re: Updates/Deletes/Upserts in Iceberg

2019-06-20 Thread Erik Wright
On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue wrote: > Replies inline. > > On Mon, Jun 17, 2019 at 10:59 AM Erik Wright > erik.wri...@shopify.com.invalid > <http://mailto:erik.wri...@shopify.com.invalid> wrote: > > Because snapshots and versions are basically the same i

Re: Updates/Deletes/Upserts in Iceberg

2019-06-20 Thread Erik Wright
gt; I don’t think that the other approach inhibits aging off data, it just > represents the deletion of the data differently. > > True, but if we can preserve the existing functionality then this is a > quick operation. Usually, these file-level deletes align with partitioning > so

Re: Updates/Deletes/Upserts in Iceberg

2019-06-20 Thread Erik Wright
On Thu, Jun 20, 2019 at 3:26 PM Erik Wright wrote: > > > On Thu, Jun 20, 2019 at 12:57 PM Ryan Blue wrote: > >> Sounds like we’re in agreement on the direction! Let’s have a sync up >> sometime next week to make sure we are agreed and plan some of this work. >> Wh

Re: Updates/Deletes/Upserts in Iceberg

2019-06-21 Thread Erik Wright
With regards to operation values. Currently they are: - append: data files were added and no files were removed. - replace: data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files. - overwrite: data files were deleted and

Re: Updates/Deletes/Upserts in Iceberg

2019-07-03 Thread Erik Wright
ls. I am available on Thursday/Friday this week and would be great >>>> to sync. >>>> >>>> Thanks, >>>> Anton >>>> >>>> On 3 Jul 2019, at 01:29, Ryan Blue wrote: >>>> >>>> Sorry I didn'

Re: Updates/Deletes/Upserts in Iceberg

2019-07-03 Thread Erik Wright
at 2:44 PM Owen O'Malley wrote: > It works for me too. > > .. Owen > > On Jul 3, 2019, at 11:27, Anton Okolnychyi > wrote: > > Works for me too. > > On 3 Jul 2019, at 19:09, Erik Wright > wrote: > > That works for me. > > On Wed, Jul 3, 2019

Re: Row-level delete sync notes - July 2019

2019-08-07 Thread Erik Wright
at 3:47 PM Ryan Blue wrote: > Hi everyone, sorry it took a while for me to get these notes sent out. > Please reply with discussion or corrections. > > *Attendees*: > > Ryan Blue > Anjali Norwood > Jacques Nadeau > Anton Okolnychyi > David Muto > Erik Wright > Owen

Re: Proposal - Priority based commit ordering on partitions

2022-10-13 Thread Erik Wright
> > Yes, the main issue we are trying to solve is the conflicts happening > between maintenance processes and other writes. > Please be a bit more specific. As you highlight in your proposal, appending new data to a dataset can be retried after a conflict. And replacing some files can also be retr

Presto Partitioning question

2018-11-27 Thread Erik Wright
> > *Presto Integration* >> >> If I understand correctly, there is no current support for reading >> Iceberg data directly from Presto. I imagine, however, that as long as your >> partition specifications are restricted to the `identity` transform it >> should be "easy" to consume an Iceberg table

Re: Status of Spark Integration, Questions

2018-11-27 Thread Erik Wright
> > *Upserts/Deletes* >> >> I have jobs that apply upserts/deletes to datasets. My current approach >> is: >> >>1. calculate the affected partitions (collected in the Driver) >>2. load up the previous versions of all of those partitions as a >>DataFrame >>3. apply the upserts/delete

merge-on-read?

2018-11-27 Thread Erik Wright
Has any consideration been given to the possibility of eventual merge-on-read support in the Iceberg table spec?

Re: merge-on-read?

2018-11-28 Thread Erik Wright
in building delete and upsert > > features. Those would create files that track the changes, which would be > > merged at read time to apply them. Is that what you mean? > > > > rb > > > > On Tue, Nov 27, 2018 at 12:26 PM Erik Wright > > wrote: > > > >> Has any consideration been given to the possibility of eventual > >> merge-on-read support in the Iceberg table spec? > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > >

Re: merge-on-read?

2018-11-28 Thread Erik Wright
allows the schema of your insert files to be consistent with the dataset schema (with respect to nullability). The delete optimization sounds clever. .. Owen > > On Wed, Nov 28, 2018 at 1:14 PM Erik Wright .invalid> > wrote: > > > Those are both really neat use cases, but th

Re: merge-on-read?

2018-11-30 Thread Erik Wright
Hi Ryan, Owen, Just following up on this question. Implemented properly, do you see any reason that a series of PRs to implement merge-on-read support wouldn't be welcomed? Thanks, Erik On Wed., Nov. 28, 2018, 5:25 p.m. Erik Wright > > On Wed, Nov 28, 2018 at 4:32 PM Owen O'

Re: merge-on-read?

2018-11-30 Thread Erik Wright
is sort of thing already, so there are other > people that could collaborate on it. > > On Fri, Nov 30, 2018 at 6:23 PM Erik Wright .invalid> > wrote: > > > Hi Ryan, Owen, > > > > Just following up on this question. Implemented properly, do you see any > &

Re: merge-on-read?

2018-12-07 Thread Erik Wright
nterested in building delete and upsert > > features. Those would create files that track the changes, which would be > > merged at read time to apply them. Is that what you mean? > > > > rb > > > > On Tue, Nov 27, 2018 at 12:26 PM Erik Wright > > wrote: &