Re: [DISCUSS] Iceberg Materialzied Views

Benny Chow Fri, 06 Sep 2024 12:19:47 -0700

Hi Steven

Yes, I definitely think #2 is easier and cleaner for both reader and writer
and that lineage is a separate feature all together.  There's no need to
couple materialization state with view lineage.


The other way to look at helping to decide between the two options is
what is the most performant design that allows the reader to quickly plan a
query to use the materialization?

- Engine A may be able to use the MV without parsing the MV's SQL or any
nested SQL.  It can see there is a materialization table, grab the refresh
state and parallelize the catalog calls to validate the staleness.  With
the refresh start time property I suggested in the PR, that could be used
as an approximation of the staleness without even doing the table level
validity checks.

- Engine B (who is a little more sophisticated) would still expand the MV
SQL and nested SQLs to validate course and fine grained access control
policies before deciding it can proceed to use the materialization.  After
that, it's the same as Engine A in terms of being able to parallelize the
table validity checks.  Also, if the catalog supported bulk API extensions,
then those could be leveraged by passing a batch of identifiers, refs
and/or UUIDs.

Thanks
Benny

On Fri, Sep 6, 2024 at 11:24 AM Steven Wu <stevenz...@gmail.com> wrote:

> If the main goal is to determine if the storage table is fresh or stable.
> Let's look at how the writer and reader would behave.
>
> reader
> * check if there is any change to the definitions/versions of view and
> referenced upstream views .
> * if not, reader can check the refresh state info stored in snapshot
> summary of the storage table
>
> writer
> * fully/recursively resolve the lineage to all direct and indirect source
> tables and views. This is needed as referenced source view may have changed.
> * refresh the storage table (full or incremental) with the states of the
> source tables and views stored in the snapshot summary.
>
> It seems that approach #2 (only refresh-state in snapshot summary) is
> easier for both writer and reader.
>
> Trying to see what benefit would approach #1 (separate lineage in view and
> refresh-state in storage table). Walaa's doc mentioned that lineage info
> may be leveraged for general data lineage purposes. It can also be
> satisfied with one extra hop to the storage snapshot summary. Benny also
> suggested that data lineage should be a separate concern. Otherwise, we
> could argue that regular table and view should also store the lineage
> information.
>
> On Tue, Sep 3, 2024 at 11:29 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Hi Dan, Thanks for the feedback.
>>
>> For Drop/Recreate, as long as we verify the UUID is still current (i.e.,
>> the table in the catalog still maintains the same UUID as in the one in the
>> lineage), we can assume the table is not dropped and recreated. Once it is
>> dropped or recreated, the UUID in the catalog will diverge from the UUID in
>> the lineage, and hence we can determine the MV is stale.
>>
>> Now if we want to support an MV refresh after the Drop/Recreate of child
>> table/view, we have to either refresh the UUID in the lineage, or create a
>> new version of the (main) view with updated UUID in the lineage. This
>> problem is common with other view constructs like schema: when underlying
>> tables change, parent view schemas become stale, and need to be refreshed,
>> either in place, or through a new version. I agree we need a wider design
>> discussion on the behavior of refreshing views (not to be confused with
>> refreshing materialized views). Whatever solution we adopt for view refresh
>> should be possible to leverage within MV refresh upon Drop/Recreate or
>> child tables/views.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Tue, Sep 3, 2024 at 10:52 PM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to mention that Walaa's document doesn't mention any
>>> downsides of Approach #1 in the comparison.
>>>
>>> Approach #1 has downsides when views are referenced in the materialized
>>> view query. These problems arise because only direct children are stored in
>>> the lineage. This leads to:
>>>
>>> - The lineage of child views has to be expanded everytime the freshness
>>> of the materialized views is to be determined. Resulting in recursive
>>> catalog calls at read time
>>>
>>> - the lineage field has to be present in child views, otherwise it's not
>>> possible to obtain the table identifiers.
>>>
>>> Thanks,
>>>
>>> Jan
>>>
>>> Am 04.09.2024 06:28 schrieb Daniel Weeks <dwe...@apache.org>:
>>>
>>> I'm generally in favor of approach #1 with UUID in lineage.
>>>
>>> I think it's helpful to know if the underlying table changes (e.g.
>>> identifier remains the same, but the table was changed).  However, I'm not
>>> sure what the behavior would be in that case.  Any refresh at that point
>>> would not be able to produce a match between the lineage information and
>>> the state (without updating the view lineage information).
>>>
>>> If we want to support operations like RTAS or Drop/Create on tables
>>> referenced by a view, #1 might not work.  My understanding is that most
>>> database systems wouldn't allow that, but it could also be hard to enforce
>>> without a more complicated catalog implementation.
>>>
>>> -Dan
>>>
>>> On Tue, Sep 3, 2024 at 11:21 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>> Hi Steven,
>>>
>>> Thanks! I see the PR now introduces UUID in the state information. That
>>> is great progress. I still have slight preference on where to place UUID
>>> (in lineage vs state), which I summarized in this doc [1] as promised. I
>>> wrote the doc before UUID was added in the PR, so it still compares with
>>> the approach that does not leverage UUIDs at all. So in summary the doc
>>> compares between three approaches:
>>>
>>> (1) With UUID in lineage.
>>> (2) With UUID in state (current approach in the PR).
>>> (3) Without UUID at all (previous state before the recent PR update).
>>>
>>> I have slight preference for (1) as I think it cleaner and useful for
>>> future applications (as outlined in the doc).
>>>
>>> In regards to how to proceed, I think technically we will need a vote.
>>> Timing of vote depends on when the contributor creating the vote is
>>> comfortable with starting it.
>>>
>>> [1]
>>> https://docs.google.com/document/d/1ORIaEfoRAV3qBFnIpGsYo8rJur31-1qLB_pHR0po8iU/edit#heading=h.nz6s6ulcd63
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Tue, Sep 3, 2024 at 10:40 AM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>> Walaa, for the listed discussion points, how should we move forward?
>>> should we have another MV sync meeting?
>>>
>>> BTW, Jan's latest spec PR addressed my comment on UUID.
>>>
>>> On Mon, Sep 2, 2024 at 4:35 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>> Hi Jan,
>>>
>>> I think we need further discussion for a few reasons:
>>>
>>> * Semantic issues are significant, especially in a spec. They could
>>> create gaps that are very hard to fix in the future. I would rather spend
>>> more time designing them properly than fix the gaps with band-aid solutions
>>> in the future.
>>> * Beyond the arguments based on semantics, there are
>>> technical differences between SQL/catalog table identifier and UUID. We
>>> need to dive into them and understand which construct is best suited for
>>> which use cases.
>>> * After we have asked for further feedback in the previous thread,
>>> Steven's suggestion incorporated UUIDs [1] (not in the key, but in the
>>> value, so we should consider that too, and understand the difference
>>> between using them in the value or in the key).
>>> * Ryan's suggestion [2] on the MV issue contained UUID in the value (so
>>> same like the above).
>>> * The design before the MV community sync up already used the table
>>> SQL/catalog identifiers as the only way to reference the child tables/views
>>> in the storage table. During the meeting we agreed to split up this info.
>>> As a follow-up the split used sequence numbers (so we did go with the
>>> split, and not using SQL identifiers directly), but we have shown that
>>> the split using sequence IDs does not work [3].
>>>
>>> [1] https://lists.apache.org/thread/9oh2ywlyh3vds74jlvfttxzbcs41fnop
>>> [2]
>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
>>> [3]
>>> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Sep 2, 2024 at 3:24 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> wrote:
>>>
>>> Hi Walaa,
>>>
>>> in my opinion the UUID vs table identifier discussion comes down to the
>>> question: do we introduce table identifiers in an opaque struct in the
>>> table snapshot summary or do we accept technical drawbacks?
>>>
>>> Using UUIDs with a lineage struct in the view metadata leads to one of
>>> the following drawbacks:
>>>
>>> - If we don't fully expand the query tree in the lineage, we have to
>>> expand the lineage on every read to determine freshness
>>>
>>> - If we fully expand the query tree in the lineage, the view version
>>> must be updated every time a child view changes
>>>
>>> With table identifiers we don't have any technical drawbacks. The fact
>>> that the query tree has to be expanded for the refresh operation is not a
>>> drawback, as this also has to be done for UUIDs with a not fully expanded
>>> query tree.
>>>
>>> Again, the question comes down to: do the technical drawbacks outweigh
>>> introducing table identifiers in the table snapshot summary?
>>>
>>> In my opinion yes. And if I understand correctly, most of the other
>>> people participating in the discussion have a similar stance.
>>>
>>> I'm not sure if we really need another document for the discussion.
>>>
>>> Best wishes,
>>>
>>> Jan
>>> On 29.08.24 21:10, Walaa Eldin Moustafa wrote:
>>>
>>> Hi Jan,
>>>
>>> I think we need to close the discussion on the UUID vs table identifier
>>> options and possibly cast a vote before having a productive discussion on
>>> the PR. I did not get a chance yet to post the document on the UUID vs
>>> table identifier discussion. I will do that by next week.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 29, 2024 at 5:31 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> <jank...@mailbox.org.invalid> wrote:
>>>
>>> Hi all,
>>>
>>> to move the Iceberg Materialzied View Proposal forward, I created a PR
>>> (https://github.com/apache/iceberg/pull/11041) that adds a section on
>>> Materialized Views to the View Spec. I hope we can resolve any remaining
>>> questions there, before we can start the voting process for the Proposal.
>>>
>>> It would be great if you could have a look.
>>>
>>> Best wishes,
>>>
>>> Jan
>>>
>>>
>>>

Re: [DISCUSS] Iceberg Materialzied Views

Reply via email to