Thanks Benny and Walaa for your input. I updated the doc
<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
to account for the changes as far as I understood. I would appreciate if
you had a look and give me some feedback.
If you have some open comments that are not relevant anymore due to the
changes, please close them so that we can clean up the comments section
a bit.
Regards,
Jan
On 07.06.24 08:33, Walaa Eldin Moustafa wrote:
* lineage state JSON structure
On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Hi Benny,
Your understanding is correct.
Another point that we discussed was the type of APIs engines can
use to conveniently update the storage table with view query
results as well as set the snapshot summary on the output snapshot
(one that was produced by the update). We will follow up on that
separately.
Jan, do you want to reflect the lineage + state discussion in the
doc so we can iterate on the lineage JSON structure?
Thanks,
Walaa.
On Thu, Jun 6, 2024 at 9:40 PM Benny Chow <btc...@gmail.com> wrote:
I really enjoyed listening to the replay and hearing
everyone's feedback! I'm in agreement with all 3 consensus
items, especially around Dan's idea to separate the view's
query tree lineage vs materialization's lineage state.
I'll summarize my understanding about the distinction and add
a few comments:
Materialized View's Query Tree Lineage
- It's basically the SQL representation converted to a
distinct list of tables and views.
- Stored inside view versions so if you change the view SQL,
you can include the lineage with that change.
- Tables support time travel so they can optionally include a
ref type and name/timestamp
- Views would NOT include the version (that's part of the
materialization lineage state below)
- I think we should use fully qualified identifiers here
instead of UUIDs. Dropping and re-creating a referenced table
or view doesn't break the view SQL so the lineage should not
be broken either. I also don't think we can support time
travel if we used table UUIDs here.
- Each table or view can be assigned a unique sequence
number. This sequence number is scoped to a single view version.
Materialization Lineage State
- It's basically a lookup table for the above sequence number
to either a table snapshot id or view version that was used at
the time of creating/refreshing the storage table. For views,
these are nested views within the MV's query tree - not the MV
itself.
- Stored inside the table's snapshot summary
- Additional property "refresh-version-id" to identify the
MV's version.
In order to validate the freshness of a materialization,
everything above has to be checked against the latest tables
and views. This should cover all data and query tree changes
(that I can think of) such as the "limit 100" example I gave
in Slack
<https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>.
Please let me know your thoughts.
Thanks
On Thu, Jun 6, 2024 at 7:53 AM <russell.spit...@gmail.com> wrote:
Thanks for hosting it was a very helpful meeting. I really
hope we can do more in the future to accelerate consensus
on other proposals.
I do encourage anyone on the mailing list to add your
comments offline as well, especially if you have strong
feelings. Iceberg is an open project and we realize not
everyone can attend virtual meetings and want you to know
you are welcome.
On Jun 6, 2024, at 7:11 AM, Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Hi all,
thanks to all of you who attended the meeting yesterday!
It was great to talk to you and I think we made great
progress. For those of you who weren't able to attend the
meeting, I summarized the main points below:
*
Question 1*: Should we store the "storage table pointer"
as a view property or as additional field in the view
metadata?
We reached consensus to add a *new metadata field*
"storage-table" to the view version
<https://iceberg.apache.org/view-spec/#versions> record
that stores the identifier of the the storage table. The
motivation for introducing a new field is that this
emphasizes that materialized views are part of the
standard and it enforces a common behavior.
*Question 2*: Where should the lineage-state information
be stored?
We reached consensus on storing the lineage-state
information in the *snapshot summary* of the storage
table. The motivation behind this is that the table spec
should not be concerned with defining view constructs.
*Question 3*: How should the lineage-state information be
represented?
We reached consensus on representing the lineage-state in
the form of nested objects and storing these as a
*JSON-encoded string* inside the storage table snapshot
summary.
Additionally, Dan proposed to introduce a new lineage
construct as part of the view definition in addition to
the lineage-state that is part of the storage table. The
idea is to separate the concerns. The lineage-state in
the storage table should only capture the state of the
source tables at the time of the last refresh, whereas
the lineage information in the view contains more
information about the source tables and is responsible
for resolving the identifiers. We haven't really decided
on how the new lineage construct should be represented or
integrated into the view metadata.
One point that we didn't really have the time to discuss
was Benny's comment of also storing the version-id of
views in the case that the materialized view is
referencing a view. I think we should also integrate that
into the spec.
You can find the recording of the meeting here:
https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
<https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing>
Best wishes,
Jan