Re: [DISCUSS] Iceberg Materialzied Views

Walaa Eldin Moustafa Tue, 03 Sep 2024 23:29:08 -0700

Hi Dan, Thanks for the feedback.

For Drop/Recreate, as long as we verify the UUID is still current (i.e.,
the table in the catalog still maintains the same UUID as in the one in the
lineage), we can assume the table is not dropped and recreated. Once it is
dropped or recreated, the UUID in the catalog will diverge from the UUID in
the lineage, and hence we can determine the MV is stale.


Now if we want to support an MV refresh after the Drop/Recreate of child
table/view, we have to either refresh the UUID in the lineage, or create a
new version of the (main) view with updated UUID in the lineage. This
problem is common with other view constructs like schema: when underlying
tables change, parent view schemas become stale, and need to be refreshed,
either in place, or through a new version. I agree we need a wider design
discussion on the behavior of refreshing views (not to be confused with
refreshing materialized views). Whatever solution we adopt for view refresh
should be possible to leverage within MV refresh upon Drop/Recreate or
child tables/views.

Thanks,
Walaa.


On Tue, Sep 3, 2024 at 10:52 PM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> Hi all,
>
> I would like to mention that Walaa's document doesn't mention any
> downsides of Approach #1 in the comparison.
>
> Approach #1 has downsides when views are referenced in the materialized
> view query. These problems arise because only direct children are stored in
> the lineage. This leads to:
>
> - The lineage of child views has to be expanded everytime the freshness of
> the materialized views is to be determined. Resulting in recursive catalog
> calls at read time
>
> - the lineage field has to be present in child views, otherwise it's not
> possible to obtain the table identifiers.
>
> Thanks,
>
> Jan
>
> Am 04.09.2024 06:28 schrieb Daniel Weeks <dwe...@apache.org>:
>
> I'm generally in favor of approach #1 with UUID in lineage.
>
> I think it's helpful to know if the underlying table changes (e.g.
> identifier remains the same, but the table was changed).  However, I'm not
> sure what the behavior would be in that case.  Any refresh at that point
> would not be able to produce a match between the lineage information and
> the state (without updating the view lineage information).
>
> If we want to support operations like RTAS or Drop/Create on tables
> referenced by a view, #1 might not work.  My understanding is that most
> database systems wouldn't allow that, but it could also be hard to enforce
> without a more complicated catalog implementation.
>
> -Dan
>
> On Tue, Sep 3, 2024 at 11:21 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
> Hi Steven,
>
> Thanks! I see the PR now introduces UUID in the state information. That is
> great progress. I still have slight preference on where to place UUID (in
> lineage vs state), which I summarized in this doc [1] as promised. I wrote
> the doc before UUID was added in the PR, so it still compares with the
> approach that does not leverage UUIDs at all. So in summary the doc
> compares between three approaches:
>
> (1) With UUID in lineage.
> (2) With UUID in state (current approach in the PR).
> (3) Without UUID at all (previous state before the recent PR update).
>
> I have slight preference for (1) as I think it cleaner and useful for
> future applications (as outlined in the doc).
>
> In regards to how to proceed, I think technically we will need a vote.
> Timing of vote depends on when the contributor creating the vote is
> comfortable with starting it.
>
> [1]
> https://docs.google.com/document/d/1ORIaEfoRAV3qBFnIpGsYo8rJur31-1qLB_pHR0po8iU/edit#heading=h.nz6s6ulcd63
>
> Thanks,
> Walaa.
>
>
> On Tue, Sep 3, 2024 at 10:40 AM Steven Wu <stevenz...@gmail.com> wrote:
>
> Walaa, for the listed discussion points, how should we move forward?
> should we have another MV sync meeting?
>
> BTW, Jan's latest spec PR addressed my comment on UUID.
>
> On Mon, Sep 2, 2024 at 4:35 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
> Hi Jan,
>
> I think we need further discussion for a few reasons:
>
> * Semantic issues are significant, especially in a spec. They could create
> gaps that are very hard to fix in the future. I would rather spend more
> time designing them properly than fix the gaps with band-aid solutions in
> the future.
> * Beyond the arguments based on semantics, there are
> technical differences between SQL/catalog table identifier and UUID. We
> need to dive into them and understand which construct is best suited for
> which use cases.
> * After we have asked for further feedback in the previous thread,
> Steven's suggestion incorporated UUIDs [1] (not in the key, but in the
> value, so we should consider that too, and understand the difference
> between using them in the value or in the key).
> * Ryan's suggestion [2] on the MV issue contained UUID in the value (so
> same like the above).
> * The design before the MV community sync up already used the table
> SQL/catalog identifiers as the only way to reference the child tables/views
> in the storage table. During the meeting we agreed to split up this info.
> As a follow-up the split used sequence numbers (so we did go with the
> split, and not using SQL identifiers directly), but we have shown that
> the split using sequence IDs does not work [3].
>
> [1] https://lists.apache.org/thread/9oh2ywlyh3vds74jlvfttxzbcs41fnop
> [2] https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
> [3]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
>
> Thanks,
> Walaa.
>
>
> On Mon, Sep 2, 2024 at 3:24 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
>
> Hi Walaa,
>
> in my opinion the UUID vs table identifier discussion comes down to the
> question: do we introduce table identifiers in an opaque struct in the
> table snapshot summary or do we accept technical drawbacks?
>
> Using UUIDs with a lineage struct in the view metadata leads to one of the
> following drawbacks:
>
> - If we don't fully expand the query tree in the lineage, we have to
> expand the lineage on every read to determine freshness
>
> - If we fully expand the query tree in the lineage, the view version must
> be updated every time a child view changes
>
> With table identifiers we don't have any technical drawbacks. The fact
> that the query tree has to be expanded for the refresh operation is not a
> drawback, as this also has to be done for UUIDs with a not fully expanded
> query tree.
>
> Again, the question comes down to: do the technical drawbacks outweigh
> introducing table identifiers in the table snapshot summary?
>
> In my opinion yes. And if I understand correctly, most of the other people
> participating in the discussion have a similar stance.
>
> I'm not sure if we really need another document for the discussion.
>
> Best wishes,
>
> Jan
> On 29.08.24 21:10, Walaa Eldin Moustafa wrote:
>
> Hi Jan,
>
> I think we need to close the discussion on the UUID vs table identifier
> options and possibly cast a vote before having a productive discussion on
> the PR. I did not get a chance yet to post the document on the UUID vs
> table identifier discussion. I will do that by next week.
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 29, 2024 at 5:31 AM Jan Kaul <jank...@mailbox.org.invalid>
> <jank...@mailbox.org.invalid> wrote:
>
> Hi all,
>
> to move the Iceberg Materialzied View Proposal forward, I created a PR
> (https://github.com/apache/iceberg/pull/11041) that adds a section on
> Materialized Views to the View Spec. I hope we can resolve any remaining
> questions there, before we can start the voting process for the Proposal.
>
> It would be great if you could have a look.
>
> Best wishes,
>
> Jan
>
>
>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to