Re: [DISCUSS] Materialized Views: Lineage and State information

Benny Chow Thu, 15 Aug 2024 15:27:25 -0700

I think given the constraint that catalog lookup has to be by identifier
and not UUID, I'd prefer using identifier in the refresh state.  If we use
identifiers, we can directly parallelize the catalog calls to fetch the
latest state.  If we use UUID, the engine has to go back to the MV and
possibly additional views to reconstruct the lineage map.  It's just a lot
slower and more work for the engine when there is a MV that references a
lot of views (and those views reference additional views).


Thanks
Benny


On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Jan, Micah, and Karuppayya for chiming in.
>
> I do not think 3 and 4 are at odds with each other (for example
> maintaining both lineage map and state map through UUID can achieve both).
> Also, I do not think we can drop the lineage map since in many catalogs,
> the only lookup method is by the catalog identifier, and not the UUID.
>
> I think if we go with UUIDs in the state, we should have a lineage map
> (from identifiers to UUIDs) to go with it.
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com>
> wrote:
>
>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
>> deferring the inclusion of lineage to a future iteration.(like Micha
>> mentioned)
>> This would greatly simplify the current design.
>>
>> Also in terms of identifiers to use(UUID or catalog identifier) for the
>> refresh state
>> We will not be able to fetch the table/View using the UUID alone, for
>> example from Hive based catalog.
>> We do not have the direct mapping between UUID and table/view.
>> Which leaves us only with the catalog identifiers?
>>
>> Thanks & Regards
>> Karuppayya
>>
>>
>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> I think it might be worth restating perceived requirements and making
>>> sure there is alignment on them.
>>>
>>> If I am reading correctly, I think the following are perceived
>>> requirements:
>>> 1. An engine must be able to unambiguously detect that an underlying
>>> queried entity has changed or not via metadata to decide if materialized
>>> table data can be used.
>>> 2. The number of sequential catalog reads an engine needs to make to
>>> make use of a materialized table state at read time is minimized.
>>> 3. Engines that don't understand a SQL dialect can still use MV
>>> information if it is not stale.
>>> 4. Table refs (catalog identifiers) should not appear in the
>>> materialized table metadata (i.e. state).
>>> 5. The view part of the MV definition should not need a new revision for
>>> any changes to objects it queries as long as their schemas stay compatible
>>> (only state information on the materialized table need to change).
>>>
>>> In my mind, requirement 1, is the only true requirement.  I think this
>>> necessitates having UUID + snapshot ID as part of the state information
>>> (not necessarily part of the Lineage).  I think it also necessitates having
>>> a denormalized view of all entities that are inputs into the MV in the
>>> state information (a view object might not change but its underlying tables
>>> or views could change and that must be detected).
>>>
>>> Requirements 2 and 5 are somewhat at odds with each other.  If
>>> information is denormalized (fully expanded) in Lineage, it means if table
>>> information is somehow dropped from an intermediate view, one would need to
>>> update the view (or make excess calls to the catalog). In my mind, this
>>> argues for normalization of the lineage stored on the view (with the cost
>>> of potentially 1 additional serial catalog lookup once the state
>>> information is retrieved).
>>>
>>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>>> keeping #4 (and as Jan noted #4 adds complexity).
>>>
>>> I think the last remaining question is if lineage serves any purpose.  I
>>> think it is useful for the following reasons:
>>> a)  When there are no intermediate views queried, it allows for fully
>>> parallelized lookup calls to the catalog without having to parse the SQL
>>> statement first
>>> b)  Allows tools that don't need to lookup state information  or parse
>>> SQL but still navigate MV/view trees.
>>>
>>> Both of these seem relatively minor, so lineage could perhaps be left
>>> out in the first iteration.
>>>
>>> As it applies to Jan's questions:
>>>
>>> 1. Should we move the identifiers out of the refresh-state into a new
>>>> lineage record that is stored as part of the view metadata?
>>>
>>> No, I don't think so, I think #5 is a reasonable requirement and I think
>>> this violates it.
>>>
>>>
>>>> 2. If yes, should the lineage in the view be fully expanded?
>>>
>>> No, I think only the state should be fully expanded (for reasons
>>> mentioned above, it potentially requires more updates to the view then
>>> necessary).
>>>
>>>
>>>> 3. What should be used as an identifier in the lineage to reference
>>>> entries in the refresh-state?
>>>
>>>
>>> Catalog identifiers make sense to me.  If we agree requirement #3 is not
>>> a requirement then it seems like this could also be UUIDs.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote:
>>>
>>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
>>>> the refresh state, then this list is fully expanded already.  So, to
>>>> validate the freshness of a materialization, the engine doesn't even need
>>>> to look at the view lineage.  IMO, the view lineage is nice to have but not
>>>> a necessary requirement for MVs.  The view lineage makes sharing of views
>>>> between engines without common SQL dialects possible.
>>>>
>>>> Benny
>>>>
>>>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to reemphasize the purpose of the refresh-state for
>>>>> materialized views. The purpose is to determine if the precomputed data is
>>>>> fresh, stale or invalid. For that the current snapshot-id of every table 
>>>>> in
>>>>> the query tree has to be fetched from the catalog by using its full
>>>>> identifier and ref. Additionally the refresh state stores the snapshot-id
>>>>> of the last refresh.
>>>>>
>>>>> To summarize: *To determine the freshness of the precomputed data we
>>>>> require the full identifier + ref and snapshot-id of the last refresh for
>>>>> every table in the fully expanded query tree*
>>>>>
>>>>> This is a requirement from how the catalog works and independent from
>>>>> how we design the lineage/refresh state. Additionally we previously agreed
>>>>> that we should be able to obtain the full list of identifiers without
>>>>> needing to parse the SQL definition.
>>>>>
>>>>> Now we are having a discussion in how to store and obtain the fully
>>>>> expanded list of table identifiers and snapshot-ids. To move the 
>>>>> discussion
>>>>> forward I think it would be valuable to answer the following 3 questions:
>>>>>
>>>>> 1. Should we move the identifiers out of the refresh-state into a new
>>>>> lineage record that is stored as part of the view metadata?
>>>>>
>>>>> 2. If yes, should the lineage in the view be fully expanded?
>>>>>
>>>>> 3. What should be used as an identifier in the lineage to reference
>>>>> entries in the refresh-state?
>>>>>
>>>>> 1. Question:
>>>>>
>>>>> We already agreed that this would be a good idea because we wouldn't
>>>>> introduce the identifier concept to the table metadata. However, looking 
>>>>> at
>>>>> the complexity that comes with the alternatives, I would like to keep this
>>>>> question open.
>>>>>
>>>>> 2. Question:
>>>>>
>>>>> I'm against using a not fully expanded lineage in the view struct. To
>>>>> recall we require every identifier in the fully expanded query tree to
>>>>> determine the freshness. Not storing all identifiers in the lineage would
>>>>> mean to recursively call the catalog and expand the query tree at read
>>>>> time. This can lead to a large overhead for determining the refresh state
>>>>> compared to expanding the query tree once at creation time and then 
>>>>> storing
>>>>> the fully expanded lineage.
>>>>>
>>>>> 3. Question:
>>>>>
>>>>> This depends on Question 2.
>>>>>
>>>>> For a not fully expanded lineage, the only options would be uuids or
>>>>> catalog identifiers.
>>>>>
>>>>> For a fully expanded lineage the question isn't all that relevant. The
>>>>> current design specifies that the lineage is a map from an identifier to 
>>>>> an
>>>>> id and the refresh-state is a map from such id to a snapshot-id. For this
>>>>> to work we don't have to specify which kind of identifier has to be used.
>>>>> One query engine could use uuids, the other engine sequence-ids. The
>>>>> important assumption we are making is that every id that is used in the
>>>>> refresh-state has to be defined in the lineage.
>>>>> So the question about using uuids is rather, can the query engine
>>>>> trust that the id defined in the lineage is the uuid of the table.
>>>>>
>>>>>
>>>>> Regarding the complexity that comes from introducing the lineage in
>>>>> the view I would like to revisit question 1. Introducing the lineage in 
>>>>> the
>>>>> view metadata opens up the question of when should the lineage be fully
>>>>> expanded. We see that we have 3 options:
>>>>>
>>>>> 1. Not fully expanded lineage -> Expansion at read time
>>>>>
>>>>> 2. Fully expanded lineage -> Expansion at creation time
>>>>>
>>>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at
>>>>> refresh time
>>>>>
>>>>> As reading is expected to be the most frequent operation I see option
>>>>> 1 as not favorable. As the query engine has to fully expand the query tree
>>>>> for a refresh anyway, I see option 3 as the most natural. For a refresh
>>>>> operation the query engine must understand the SQL dialects of all views 
>>>>> in
>>>>> the query tree and therefore is guaranteed to successfully expand the
>>>>> lineage. This might not be the case at creation time, which makes option 2
>>>>> less favorable.
>>>>>
>>>>> As can be seen, I'm in favor of just storing the refresh-state as a
>>>>> map from identifier to snapshot-id and not using the lineage. I know that
>>>>> this introduces the concept of a catalog identifiers to the table metadata
>>>>> spec, but in my opinion it is by far the simplest option.
>>>>>
>>>>> I'm interested in your opinions.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Jan
>>>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional
>>>>> ref, although we can iterate ohe exact JSON structure (e.g., another 
>>>>> option
>>>>> is splitting for (UUID) state from (UUID + ref) state into two separate
>>>>> higher-level fields).
>>>>>
>>>>> Generally agree on REFRESH VIEW strategy could be up to the engine,
>>>>> but it seems like an area where Iceberg could have an opinion/spec on. I
>>>>> will start a separate thread for that.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to