Hi Benny,

Your understanding is correct.

Another point that we discussed was the type of APIs engines can use to
conveniently update the storage table with view query results as well as
set the snapshot summary on the output snapshot (one that was produced by
the update). We will follow up on that separately.

Jan, do you want to reflect the lineage + state discussion in the doc so we
can iterate on the lineage JSON structure?

Thanks,
Walaa.


On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <btc...@gmail.com> wrote:

> I really enjoyed listening to the replay and hearing everyone's feedback!
> I'm in agreement with all 3 consensus items, especially around Dan's idea
> to separate the view's query tree lineage vs materialization's lineage
> state.
>
> I'll summarize my understanding about the distinction and add a few
> comments:
>
> Materialized View's Query Tree Lineage
> - It's basically the SQL representation converted to a distinct list of
> tables and views.
> - Stored inside view versions so if you change the view SQL, you can
> include the lineage with that change.
> - Tables support time travel so they can optionally include a ref type and
> name/timestamp
> - Views would NOT include the version (that's part of the materialization
> lineage state below)
> - I think we should use fully qualified identifiers here instead of
> UUIDs.  Dropping and re-creating a referenced table or view doesn't break
> the view SQL so the lineage should not be broken either.  I also don't
> think we can support time travel if we used table UUIDs here.
> - Each table or view can be assigned a unique sequence number.  This
> sequence number is scoped to a single view version.
>
> Materialization Lineage State
> - It's basically a lookup table for the above sequence number to either a
> table snapshot id or view version that was used at the time of
> creating/refreshing the storage table.  For views, these are nested views
> within the MV's query tree - not the MV itself.
> - Stored inside the table's snapshot summary
> - Additional property "refresh-version-id" to identify the MV's version.
>
> In order to validate the freshness of a materialization, everything above
> has to be checked against the latest tables and views.  This should cover
> all data and query tree changes (that I can think of) such as the "limit
> 100" example I gave in Slack
> <https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>
> .
>
> Please let me know your thoughts.
>
> Thanks
>
> On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote:
>
>> Thanks for hosting it was a very helpful meeting. I really hope we can do
>> more in the future to accelerate consensus on other proposals.
>>
>>
>>  I do encourage anyone on the mailing list to add your comments offline
>> as well, especially if you have strong feelings. Iceberg is an open project
>> and we realize not everyone can attend virtual meetings and want you to
>> know you are welcome.
>>
>>
>>
>> On Jun 6, 2024, at 7:11 AM, Jan Kaul <jank...@mailbox.org.invalid> wrote:
>>
>> 
>>
>> Hi all,
>>
>> thanks to all of you who attended the meeting yesterday! It was great to
>> talk to you and I think we made great progress. For those of you who
>> weren't able to attend the meeting, I summarized the main points below:
>>
>> * Question 1*: Should we store the "storage table pointer" as a view
>> property or as additional field in the view metadata?
>>
>> We reached consensus to add a *new metadata field* "storage-table" to
>> the view version <https://iceberg.apache.org/view-spec/#versions> record
>> that stores the identifier of the the storage table. The motivation for
>> introducing a new field is that this emphasizes that materialized views are
>> part of the standard and it enforces a common behavior.
>>
>> *Question 2*: Where should the lineage-state information be stored?
>>
>> We reached consensus on storing the lineage-state information in the 
>> *snapshot
>> summary* of the storage table. The motivation behind this is that the
>> table spec should not be concerned with defining view constructs.
>>
>> *Question 3*: How should the lineage-state information be represented?
>>
>> We reached consensus on representing the lineage-state in the form of
>> nested objects and storing these as a *JSON-encoded string* inside the
>> storage table snapshot summary.
>>
>> Additionally, Dan proposed to introduce a new lineage construct as part
>> of the view definition in addition to the lineage-state that is part of the
>> storage table. The idea is to separate the concerns. The lineage-state in
>> the storage table should only capture the state of the source tables at the
>> time of the last refresh, whereas the lineage information in the view
>> contains more information about the source tables and is responsible for
>> resolving the identifiers. We haven't really decided on how the new lineage
>> construct should be represented or integrated into the view metadata.
>>
>> One point that we didn't really have the time to discuss was Benny's
>> comment of also storing the version-id of views in the case that the
>> materialized view is referencing a view. I think we should also integrate
>> that into the spec.
>>
>> You can find the recording of the meeting here:
>>
>>
>> https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
>>
>> Best wishes,
>>
>> Jan
>>
>>

Reply via email to