I really enjoyed listening to the replay and hearing everyone's feedback!
I'm in agreement with all 3 consensus items, especially around Dan's idea
to separate the view's query tree lineage vs materialization's lineage
state.

I'll summarize my understanding about the distinction and add a few
comments:

Materialized View's Query Tree Lineage
- It's basically the SQL representation converted to a distinct list of
tables and views.
- Stored inside view versions so if you change the view SQL, you can
include the lineage with that change.
- Tables support time travel so they can optionally include a ref type and
name/timestamp
- Views would NOT include the version (that's part of the materialization
lineage state below)
- I think we should use fully qualified identifiers here instead of UUIDs.
Dropping and re-creating a referenced table or view doesn't break the view
SQL so the lineage should not be broken either.  I also don't think we can
support time travel if we used table UUIDs here.
- Each table or view can be assigned a unique sequence number.  This
sequence number is scoped to a single view version.

Materialization Lineage State
- It's basically a lookup table for the above sequence number to either a
table snapshot id or view version that was used at the time of
creating/refreshing the storage table.  For views, these are nested views
within the MV's query tree - not the MV itself.
- Stored inside the table's snapshot summary
- Additional property "refresh-version-id" to identify the MV's version.

In order to validate the freshness of a materialization, everything above
has to be checked against the latest tables and views.  This should cover
all data and query tree changes (that I can think of) such as the "limit
100" example I gave in Slack
<https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>
.

Please let me know your thoughts.

Thanks

On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote:

> Thanks for hosting it was a very helpful meeting. I really hope we can do
> more in the future to accelerate consensus on other proposals.
>
>
>  I do encourage anyone on the mailing list to add your comments offline as
> well, especially if you have strong feelings. Iceberg is an open project
> and we realize not everyone can attend virtual meetings and want you to
> know you are welcome.
>
>
>
> On Jun 6, 2024, at 7:11 AM, Jan Kaul <jank...@mailbox.org.invalid> wrote:
>
> 
>
> Hi all,
>
> thanks to all of you who attended the meeting yesterday! It was great to
> talk to you and I think we made great progress. For those of you who
> weren't able to attend the meeting, I summarized the main points below:
>
> * Question 1*: Should we store the "storage table pointer" as a view
> property or as additional field in the view metadata?
>
> We reached consensus to add a *new metadata field* "storage-table" to the view
> version <https://iceberg.apache.org/view-spec/#versions> record that
> stores the identifier of the the storage table. The motivation for
> introducing a new field is that this emphasizes that materialized views are
> part of the standard and it enforces a common behavior.
>
> *Question 2*: Where should the lineage-state information be stored?
>
> We reached consensus on storing the lineage-state information in the *snapshot
> summary* of the storage table. The motivation behind this is that the
> table spec should not be concerned with defining view constructs.
>
> *Question 3*: How should the lineage-state information be represented?
>
> We reached consensus on representing the lineage-state in the form of
> nested objects and storing these as a *JSON-encoded string* inside the
> storage table snapshot summary.
>
> Additionally, Dan proposed to introduce a new lineage construct as part of
> the view definition in addition to the lineage-state that is part of the
> storage table. The idea is to separate the concerns. The lineage-state in
> the storage table should only capture the state of the source tables at the
> time of the last refresh, whereas the lineage information in the view
> contains more information about the source tables and is responsible for
> resolving the identifiers. We haven't really decided on how the new lineage
> construct should be represented or integrated into the view metadata.
>
> One point that we didn't really have the time to discuss was Benny's
> comment of also storing the version-id of views in the case that the
> materialized view is referencing a view. I think we should also integrate
> that into the spec.
>
> You can find the recording of the meeting here:
>
>
> https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
>
> Best wishes,
>
> Jan
>
>

Reply via email to