Great idea, let's try to resolve all issues in the doc until 30th June
and if there are still some open points we will have another meeting.
Best wishes
Jan
On 20.06.24 19:10, Walaa Eldin Moustafa wrote:
Jan, I think there are a couple of open questions on the doc. Let us
discuss them on the doc for a week, then meet again if they are still
open?
Thanks,
Walaa.
On Fri, Jun 7, 2024 at 12:27 PM Jan Kaul <jank...@mailbox.org.invalid>
wrote:
No that's great, thank you. I'm thankful for the input.
Jan
Am 07.06.2024 17:53 schrieb Benny Chow <btc...@gmail.com>:
Looks good Jan. I'm a bit nit pick on picking good names so I
left some comments around that to see what others think.
Thanks
On Fri, Jun 7, 2024 at 2:26 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Thanks Benny and Walaa for your input. I updated the doc
<https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
to account for the changes as far as I understood. I would
appreciate if you had a look and give me some feedback.
If you have some open comments that are not relevant
anymore due to the changes, please close them so that we
can clean up the comments section a bit.
Regards,
Jan
On 07.06.24 08:33, Walaa Eldin Moustafa wrote:
* lineage state JSON structure
On Thu, Jun 6, 2024 at 11:31 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Hi Benny,
Your understanding is correct.
Another point that we discussed was the type of
APIs engines can use to conveniently update the
storage table with view query results as well as
set the snapshot summary on the output snapshot
(one that was produced by the update). We will
follow up on that separately.
Jan, do you want to reflect the lineage + state
discussion in the doc so we can iterate on the
lineage JSON structure?
Thanks,
Walaa.
On Thu, Jun 6, 2024 at 9:40 PM Benny Chow
<btc...@gmail.com> wrote:
I really enjoyed listening to the replay and
hearing everyone's feedback! I'm in agreement
with all 3 consensus items, especially around
Dan's idea to separate the view's query tree
lineage vs materialization's lineage state.
I'll summarize my understanding about the
distinction and add a few comments:
Materialized View's Query Tree Lineage
- It's basically the SQL representation
converted to a distinct list of tables and views.
- Stored inside view versions so if you change
the view SQL, you can include the lineage with
that change.
- Tables support time travel so they can
optionally include a ref type and name/timestamp
- Views would NOT include the version (that's
part of the materialization lineage state below)
- I think we should use fully qualified
identifiers here instead of UUIDs. Dropping
and re-creating a referenced table or view
doesn't break the view SQL so the lineage
should not be broken either. I also don't
think we can support time travel if we used
table UUIDs here.
- Each table or view can be assigned a unique
sequence number. This sequence number is
scoped to a single view version.
Materialization Lineage State
- It's basically a lookup table for the above
sequence number to either a table snapshot id
or view version that was used at the time of
creating/refreshing the storage table. For
views, these are nested views within the MV's
query tree - not the MV itself.
- Stored inside the table's snapshot summary
- Additional property "refresh-version-id" to
identify the MV's version.
In order to validate the freshness of a
materialization, everything above has to be
checked against the latest tables and views.
This should cover all data and query tree
changes (that I can think of) such as the
"limit 100" example I gave in Slack
<https://apache-iceberg.slack.com/archives/C06LPRD60EL/p1717476837288479?thread_ts=1717173133.294819&cid=C06LPRD60EL>.
Please let me know your thoughts.
Thanks
On Thu, Jun 6, 2024 at 7:53 AM
<russell.spit...@gmail.com> wrote:
Thanks for hosting it was a very helpful
meeting. I really hope we can do more in
the future to accelerate consensus on
other proposals.
I do encourage anyone on the mailing list
to add your comments offline as well,
especially if you have strong feelings.
Iceberg is an open project and we realize
not everyone can attend virtual meetings
and want you to know you are welcome.
On Jun 6, 2024, at 7:11 AM, Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid>
wrote:
Hi all,
thanks to all of you who attended the
meeting yesterday! It was great to
talk to you and I think we made great
progress. For those of you who weren't
able to attend the meeting, I
summarized the main points below:
*
Question 1*: Should we store the
"storage table pointer" as a view
property or as additional field in the
view metadata?
We reached consensus to add a *new
metadata field* "storage-table" to the
view version
<https://iceberg.apache.org/view-spec/#versions>
record that stores the identifier of
the the storage table. The motivation
for introducing a new field is that
this emphasizes that materialized
views are part of the standard and it
enforces a common behavior.
*Question 2*: Where should the
lineage-state information be stored?
We reached consensus on storing the
lineage-state information in the
*snapshot summary* of the storage
table. The motivation behind this is
that the table spec should not be
concerned with defining view constructs.
*Question 3*: How should the
lineage-state information be represented?
We reached consensus on representing
the lineage-state in the form of
nested objects and storing these as a
*JSON-encoded string* inside the
storage table snapshot summary.
Additionally, Dan proposed to
introduce a new lineage construct as
part of the view definition in
addition to the lineage-state that is
part of the storage table. The idea is
to separate the concerns. The
lineage-state in the storage table
should only capture the state of the
source tables at the time of the last
refresh, whereas the lineage
information in the view contains more
information about the source tables
and is responsible for resolving the
identifiers. We haven't really decided
on how the new lineage construct
should be represented or integrated
into the view metadata.
One point that we didn't really have
the time to discuss was Benny's
comment of also storing the version-id
of views in the case that the
materialized view is referencing a
view. I think we should also integrate
that into the spec.
You can find the recording of the
meeting here:
https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing
<https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing>
Best wishes,
Jan