Thanks for putting this forward.
Another term for the "lazy" approach would be "merge on read".
My team has built something internally that uses merge-on-read internally
but uses an "Eager" materialization for publication to Presto. Roughly, we
maintain a table metadata file that looks a bit like
r insert by mapping the incoming
>> keys to existing files for updates. At this point, it seems to rely on join.
>>
>> Is my understanding correct? If so, do we want to consider joins on
>> write? We mentioned this technique as one way to ensure the uniqueness of
>> natur
On Thu, May 16, 2019 at 4:13 PM Ryan Blue wrote:
> Replies inline.
>
> On Thu, May 16, 2019 at 10:07 AM Erik Wright
> wrote:
>
>> I would be happy to participate. Iceberg with merge-on-read capabilities
>> is a technology choice that my team is actively consider
oncurrency that are
supported (I'm not sure about that, really, just going off of your comment
above).
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Tue, May 21, 2019 at 7:54 AM Erik Wright
> wrote:
>
>> On Thu, May 16, 2019 at 4:13 PM Ryan Blue wrote:
&g
>
> We have two rows with the same natural key and we use that natural key in
> diff files:
> nk | col1 | col2
> 1 | 1 | 1
> 1 | 2 | 2
> Then we have a delete statement:
> DELETE FROM t WHERE col1 = 1
I think this example cuts to the point of the differences of understanding.
Does Iceberg want to
other constraints (that might not be appropriate to all
applications).
Myself and Miguel are out on Friday, but Anton should be able to handle the
> discussion on our side.
>
>
> Thanks,
> Cristian
>
>
> On 22 May 2019, at 17:51, Erik Wright
> wrote:
>
> We have two
In the process of following up on the "Updates/Deletes/Upserts" thread, I'm
re-reading the table spec. I have a question about Manifest List files.
If I understand correctly, the manifest list files are separate files that
are created prior to attempting to commit a new snapshot. Each snapshot may
a new PR out to rewrite
> manifests to take advantage of this:
> https://github.com/apache/incubator-iceberg/pull/200/files
>
> Does that answer your question?
>
> On Mon, Jun 3, 2019 at 1:38 PM Erik Wright
> wrote:
>
>> In the process of following up on the "Updates
ation in the root metadata file.
>
> On Mon, Jun 3, 2019 at 2:13 PM Erik Wright
> wrote:
>
>> Thanks for the response, Ryan. I can certainly see the benefits of
>> manifest files are. I can see that with potentially long lists of valid
>> snapshots, each having long
gt;>>>> that uniqueness cannot be enforced in Iceberg.
>>>>>
>>>>> If uniqueness can’t be enforced in Iceberg, the main choice comes down
>>>>> to how we identify rows that are deleted. If we use (filename, position)
>>>>> then we kno
re care in implementation to handle
> some corner cases consistently.
>
> Let me know if I got the gist of your proposal right ?
>
> Thanks,
> Cristian
>
>
> On 7 Jun 2019, at 19:47, Ryan Blue wrote:
>
> Thanks, Erik! Great to see progress here. I'll set a
t snapshot).
Can you take a stab at responding to the areas of confusion I highlighted
above? Let me know if I need to do some more writing/drawing to describe
the incremental read concerns. Yes, after another round or two like this it
would likely make sense to have another video conference. At the
> Finally, Iceberg relies on regular metadata maintenance — manifest
> compaction — to reduce both write volume and the number of file reads
> needed to plan a scan. Snapshots also reuse metadata files from previous
> snapshots to reduce write volume.
>
> On Wed, Jun 12, 201
applied to the effective dataset, a consumer can read the
entire dataset as of any version or incrementally observe changes to the
dataset from version to version. And the same API can be used to consume
the dataset regardless of which method was used to produce it.
Thanks,
> Anton
>
> On 17 Ju
On Mon, Jun 17, 2019 at 5:51 PM Erik Wright wrote:
> That's a really insightful summary of the different proposals that have
> been made. My reading of Ryan's suggestion pointed to ways that both
> approaches can be supported. The precise mechanisms available
> for re
On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue wrote:
> Replies inline.
>
> On Mon, Jun 17, 2019 at 10:59 AM Erik Wright
> erik.wri...@shopify.com.invalid
> <http://mailto:erik.wri...@shopify.com.invalid> wrote:
>
> Because snapshots and versions are basically the same i
gt; I don’t think that the other approach inhibits aging off data, it just
> represents the deletion of the data differently.
>
> True, but if we can preserve the existing functionality then this is a
> quick operation. Usually, these file-level deletes align with partitioning
> so
On Thu, Jun 20, 2019 at 3:26 PM Erik Wright wrote:
>
>
> On Thu, Jun 20, 2019 at 12:57 PM Ryan Blue wrote:
>
>> Sounds like we’re in agreement on the direction! Let’s have a sync up
>> sometime next week to make sure we are agreed and plan some of this work.
>> Wh
With regards to operation values. Currently they are:
- append: data files were added and no files were removed.
- replace: data files were rewritten with the same data; i.e.,
compaction, changing the data file format, or relocating data files.
- overwrite: data files were deleted and
ls. I am available on Thursday/Friday this week and would be great
>>>> to sync.
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> On 3 Jul 2019, at 01:29, Ryan Blue wrote:
>>>>
>>>> Sorry I didn'
at 2:44 PM Owen O'Malley wrote:
> It works for me too.
>
> .. Owen
>
> On Jul 3, 2019, at 11:27, Anton Okolnychyi
> wrote:
>
> Works for me too.
>
> On 3 Jul 2019, at 19:09, Erik Wright
> wrote:
>
> That works for me.
>
> On Wed, Jul 3, 2019
at 3:47 PM Ryan Blue wrote:
> Hi everyone, sorry it took a while for me to get these notes sent out.
> Please reply with discussion or corrections.
>
> *Attendees*:
>
> Ryan Blue
> Anjali Norwood
> Jacques Nadeau
> Anton Okolnychyi
> David Muto
> Erik Wright
> Owen
>
> Yes, the main issue we are trying to solve is the conflicts happening
> between maintenance processes and other writes.
>
Please be a bit more specific. As you highlight in your proposal, appending
new data to a dataset can be retried after a conflict. And replacing some
files can also be retr
>
> *Presto Integration*
>>
>> If I understand correctly, there is no current support for reading
>> Iceberg data directly from Presto. I imagine, however, that as long as your
>> partition specifications are restricted to the `identity` transform it
>> should be "easy" to consume an Iceberg table
>
> *Upserts/Deletes*
>>
>> I have jobs that apply upserts/deletes to datasets. My current approach
>> is:
>>
>>1. calculate the affected partitions (collected in the Driver)
>>2. load up the previous versions of all of those partitions as a
>>DataFrame
>>3. apply the upserts/delete
Has any consideration been given to the possibility of eventual
merge-on-read support in the Iceberg table spec?
in building delete and upsert
> > features. Those would create files that track the changes, which would be
> > merged at read time to apply them. Is that what you mean?
> >
> > rb
> >
> > On Tue, Nov 27, 2018 at 12:26 PM Erik Wright
> > wrote:
> >
> >> Has any consideration been given to the possibility of eventual
> >> merge-on-read support in the Iceberg table spec?
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
allows the
schema of your insert files to be consistent with the dataset schema (with
respect to nullability).
The delete optimization sounds clever.
.. Owen
>
> On Wed, Nov 28, 2018 at 1:14 PM Erik Wright .invalid>
> wrote:
>
> > Those are both really neat use cases, but th
Hi Ryan, Owen,
Just following up on this question. Implemented properly, do you see any
reason that a series of PRs to implement merge-on-read support wouldn't be
welcomed?
Thanks,
Erik
On Wed., Nov. 28, 2018, 5:25 p.m. Erik Wright
>
> On Wed, Nov 28, 2018 at 4:32 PM Owen O'
is sort of thing already, so there are other
> people that could collaborate on it.
>
> On Fri, Nov 30, 2018 at 6:23 PM Erik Wright .invalid>
> wrote:
>
> > Hi Ryan, Owen,
> >
> > Just following up on this question. Implemented properly, do you see any
> &
nterested in building delete and upsert
> > features. Those would create files that track the changes, which would be
> > merged at read time to apply them. Is that what you mean?
> >
> > rb
> >
> > On Tue, Nov 27, 2018 at 12:26 PM Erik Wright
> > wrote:
&
31 matches
Mail list logo