Re: [DISCUSS] Materialized Views: Lineage and State information

Benny Chow Wed, 14 Aug 2024 11:53:16 -0700

I'd like to hear Jan's feedback on using UUID and normalizing the view
lineage.  I'm on board with this change.


I updated the fully spec'd out example using UUID and a normalized view
linage:

https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.o6yn2lnpxowg

Thanks a lot Walaa for driving this discussion.

We need to revisit the issue about identifying the refresh table for
snapshot references.  The refresh table needs to include both UUID and the
"ref".  Because of this, I think we should still include the fully
qualified identifier in the refresh state.  If we do this, then the state
looks a lot like the lineage except that it includes the version property
and is fully denormalized.  Alternatively, the refresh table just has UUID
and an optional "ref' and no identifier.  What do you guys think of this?

Regarding dropping and re-creating a dependent view in the MV, I think we
should keep it flexible for the engine to decide what that means for the
materialization validity.  Once the view is re-created, the version-ids
inside are all new and don't match up with the refresh state.  For REFRESH
VIEW, if the view schema changes in an additive way such as new columns
projected from a SELECT *, then MVs built on top could still be valid.
Again, this would be up to the engine to decide on the behavior.

Thanks
Benny

On Tue, Aug 13, 2024 at 11:47 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Everyone,
>
> Just a follow up on this thread. Thanks Benny and Micah for the discussion
> on the doc [1]. We have been converging more on using UUIDs from the
> discussion. The only open question was related to UUIDs (of underlying
> views/tables) being stale upon a REPLACE (or DROP and CREATE) operation on
> those tables/views. The response to that question is:
>
> (1) While stale, those UUIDs are still good for freshness calculations.
> (2) If we really want to refresh them, that might be in scope of a REFRESH
> VIEW operation (not to be confused with REFRESH MV).
>
> (2) seems appropriate for a separate discussion since REFRESH VIEW
> operation behavior can be concerned with other things as well like
> refreshing the schema.
>
> Detailed discussion on this question can be found in this thread in the
> doc [2]. If people are okay with the stance above, we can move forward
> towards a vote on using UUIDs. I will leave a couple of days for the
> discussion to wrap up in this thread/the doc.
>
> [1]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
> [2]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit?disco=AAABTXtgBFw
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 8, 2024 at 4:43 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
>> Thanks Benny! We discussed this option during the meeting but we did not
>> prefer it because we did not want to leak the SQL identifiers to the
>> storage table since SQL identifiers are view concepts and fit better with
>> the view.
>>
>> Thanks,
>> Walaa.
>>
>> On Thu, Aug 8, 2024 at 4:12 PM Benny Chow <btc...@gmail.com> wrote:
>>
>>> Maybe a third option is to decouple the view lineage and materialization
>>> state.
>>>
>>> The view lineage can just list out the SQL identifiers+ref... we can
>>> still decide whether this is just direct children or fully expanded.
>>> The materialization state doesn't have to depend on the view lineage
>>> (through either sequence ID or UUIDs).  It's a direct mapping of SQL
>>> identifiers (possibly with refs) to their corresponding snapshot id or view
>>> versions.
>>>
>>> The materialization state was our original design until Dan brought up
>>> the idea to also capture the view lineage.  So, what I am suggesting is
>>> that we do both but not couple them together through any sort of sequence
>>> ID or UUID.
>>>
>>> Thanks
>>> Benny
>>>
>>> On Thu, Aug 8, 2024 at 2:04 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> In the last community sync on Materialized Views [1], we agreed to
>>>> split the information that is used to determine the materialized view
>>>> staleness to two parts: Lineage Information and State Information. We have
>>>> made a lot of progress on representing both but one issue remains open:
>>>>
>>>> Both lineage information and state information express facts about the
>>>> child objects of the MV, and to correlate those facts we need some sort of
>>>> an ID for those objects.
>>>> The open question is whether the ID should be of type sequence ID or
>>>> UUID (i.e., the Iceberg UUID of the object).
>>>>
>>>> I have described the issue in more detail in this doc [2]. I have also
>>>> added some thoughts on the advantages of using UUIDs in the same doc.
>>>> Further, there is a discussion thread on that topic in the MV spec doc [3].
>>>>
>>>> Please feel free to comment on the docs or share your thoughts in the
>>>> thread. After the discussion we can move forward to a vote.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> [1] https://lists.apache.org/thread/bhmxo1w1bdp1p2hh842kpm2gy1g5rscp
>>>> [2]
>>>> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
>>>> [3]
>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABPUgvras
>>>>
>>>

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to