Jan, I think there are a couple of open questions on the doc. Let us discuss them on the doc for a week, then meet again if they are still open?
Thanks, Walaa. On Fri, Jun 7, 2024 at 12:27 PM Jan Kaul <[email protected]> wrote: > No that's great, thank you. I'm thankful for the input. > > Jan > > Am 07.06.2024 17:53 schrieb Benny Chow <[email protected]>: > > Looks good Jan. I'm a bit nit pick on picking good names so I left some > comments around that to see what others think. > > Thanks > > On Fri, Jun 7, 2024 at 2:26 AM Jan Kaul <[email protected]> > wrote: > > Thanks Benny and Walaa for your input. I updated the doc > <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> > to account for the changes as far as I understood. I would appreciate if > you had a look and give me some feedback. > > If you have some open comments that are not relevant anymore due to the > changes, please close them so that we can clean up the comments section a > bit. > > Regards, > > Jan > On 07.06.24 08:33, Walaa Eldin Moustafa wrote: > > * lineage state JSON structure > > On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa < > [email protected]> wrote: > > Hi Benny, > > Your understanding is correct. > > Another point that we discussed was the type of APIs engines can use to > conveniently update the storage table with view query results as well as > set the snapshot summary on the output snapshot (one that was produced by > the update). We will follow up on that separately. > > Jan, do you want to reflect the lineage + state discussion in the doc > so we can iterate on the lineage JSON structure? > > Thanks, > Walaa. > > > On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <[email protected]> wrote: > > I really enjoyed listening to the replay and hearing everyone's feedback! > I'm in agreement with all 3 consensus items, especially around Dan's idea > to separate the view's query tree lineage vs materialization's lineage > state. > > I'll summarize my understanding about the distinction and add a few > comments: > > Materialized View's Query Tree Lineage > - It's basically the SQL representation converted to a distinct list of > tables and views. > - Stored inside view versions so if you change the view SQL, you can > include the lineage with that change. > - Tables support time travel so they can optionally include a ref type and > name/timestamp > - Views would NOT include the version (that's part of the materialization > lineage state below) > - I think we should use fully qualified identifiers here instead of > UUIDs. Dropping and re-creating a referenced table or view doesn't break > the view SQL so the lineage should not be broken either. I also don't > think we can support time travel if we used table UUIDs here. > - Each table or view can be assigned a unique sequence number. This > sequence number is scoped to a single view version. > > Materialization Lineage State > - It's basically a lookup table for the above sequence number to either a > table snapshot id or view version that was used at the time of > creating/refreshing the storage table. For views, these are nested views > within the MV's query tree - not the MV itself. > - Stored inside the table's snapshot summary > - Additional property "refresh-version-id" to identify the MV's version. > > In order to validate the freshness of a materialization, everything above > has to be checked against the latest tables and views. This should cover > all data and query tree changes (that I can think of) such as the "limit > 100" example I gave in Slack > <https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL> > . > > Please let me know your thoughts. > > Thanks > > On Thu, Jun 6, 2024 at 7:53 AM <[email protected]> wrote: > > Thanks for hosting it was a very helpful meeting. I really hope we can do > more in the future to accelerate consensus on other proposals. > > > I do encourage anyone on the mailing list to add your comments offline as > well, especially if you have strong feelings. Iceberg is an open project > and we realize not everyone can attend virtual meetings and want you to > know you are welcome. > > > > On Jun 6, 2024, at 7:11 AM, Jan Kaul <[email protected]> > <[email protected]> wrote: > > > > Hi all, > > thanks to all of you who attended the meeting yesterday! It was great to > talk to you and I think we made great progress. For those of you who > weren't able to attend the meeting, I summarized the main points below: > > * Question 1*: Should we store the "storage table pointer" as a view > property or as additional field in the view metadata? > > We reached consensus to add a *new metadata field* "storage-table" to the view > version <https://iceberg.apache.org/view-spec/#versions> record that > stores the identifier of the the storage table. The motivation for > introducing a new field is that this emphasizes that materialized views are > part of the standard and it enforces a common behavior. > > *Question 2*: Where should the lineage-state information be stored? > > We reached consensus on storing the lineage-state information in the *snapshot > summary* of the storage table. The motivation behind this is that the > table spec should not be concerned with defining view constructs. > > *Question 3*: How should the lineage-state information be represented? > > We reached consensus on representing the lineage-state in the form of > nested objects and storing these as a *JSON-encoded string* inside the > storage table snapshot summary. > > Additionally, Dan proposed to introduce a new lineage construct as part of > the view definition in addition to the lineage-state that is part of the > storage table. The idea is to separate the concerns. The lineage-state in > the storage table should only capture the state of the source tables at the > time of the last refresh, whereas the lineage information in the view > contains more information about the source tables and is responsible for > resolving the identifiers. We haven't really decided on how the new lineage > construct should be represented or integrated into the view metadata. > > One point that we didn't really have the time to discuss was Benny's > comment of also storing the version-id of views in the case that the > materialized view is referencing a view. I think we should also integrate > that into the spec. > > You can find the recording of the meeting here: > > > https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing > > Best wishes, > > Jan > > >
