Re: [DISCUSS] Materialized Views: Lineage and State information

Benny Chow Mon, 19 Aug 2024 22:54:49 -0700

Hi Walaa, I personally don't see a semantic issue with putting the table
identifiers in the refresh state.  The purpose of the refresh state is to
basically take a snapshot of the table and view versions at the time of
materialization.  Directly using table identifiers seems pretty natural to
me.   So, I'm +1 for:


no lineage
+ refresh-state key = identifier

Thanks
Benny



On Mon, Aug 19, 2024 at 9:19 PM Walaa Eldin Moustafa <[email protected]>
wrote:

> Hi Micah, it is mostly about the typical results of denormalization such
> as data consistency, management complexity, integrity, etc. However, as
> mentioned earlier, the main reason would be the semantic gap around using
> catalog table identifiers as a concept in the table (more specifically
> snapshot summary) spec. Denormalization in this case is a minor issue.
>
> Thanks,
> Walaa.
>
>
> On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Thanks Micah, for the latter, I meant the type of denormalization of
>>> repeating a 3-part name as opposed to using an ID.
>>
>>
>> Is the concern here just metadata size or something else?  For size I
>> think if this is really anticipated to be a problem that it is likely for
>> the state map in general, and we could investigate some more sophisticated
>> encodings for the State map even without the overlap. I think maybe this
>> could be handled if it proves to be a problem but hopefully engines are
>> placing a reasonable cap on view depth + number of tables per view which
>> puts an upper bound on overall size.
>>
>> Thanks,
>> Micah
>>
>> On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> Thanks Micah, for the latter, I meant the type of denormalization of
>>> repeating a 3-part name as opposed to using an ID.
>>>
>>> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> However, this still does not address the semantic issue which is more
>>>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>>>>> table identifiers and this use will be the first break of this 
>>>>> abstraction.
>>>>
>>>>
>>>> IIUC, based on Jan's comments, we are not going to modify the table
>>>> specification.  I thought the state map was effectively opaque metadata
>>>> from the table specification perspective?  If this is the case I feel like
>>>> that is OK and not a blocker, I think by their nature as already discussed
>>>> MVs need catalog information to function properly and the choice to put
>>>> catalog information into the table metadata is pragmatic and preserves
>>>> other desirable properties.  It might be a more important point if we want
>>>> to update the table specification (IMO, I still think it would probably be
>>>> OK).
>>>>
>>>> On a side note, it does not address the denormalization issue either if
>>>>> we ever want to introduce the lineage in the view as a nice-to-have.
>>>>
>>>>
>>>> I think if lineage is introduced to the View metadata, it should only
>>>> hold direct dependencies for the reasons already discussed. IMO, I think
>>>> the potential overlap is OK as they serve two different purposes.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> That is right. I agree that in the case of using catalog identifiers
>>>>> in state information, using them in lineage information would be a
>>>>> nice-to-have and not a requirement.
>>>>>
>>>>> However, this still does not address the semantic issue which is more
>>>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>>>>> table identifiers and this use will be the first break of this 
>>>>> abstraction.
>>>>>
>>>>> On a side note, it does not address the denormalization issue either
>>>>> if we ever want to introduce the lineage in the view as a nice-to-have.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Walaa,
>>>>>>
>>>>>> I would argue that for the refresh operation the query engine has to
>>>>>> parse the query and then somehow execute it. For a full refresh it will
>>>>>> directly execute the query and for a incremental refresh it will execute 
>>>>>> a
>>>>>> modified version. Therefore it has to fully expand the query tree.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa <
>>>>>> [email protected]>:
>>>>>>
>>>>>> Thanks Jan for the summary.
>>>>>>
>>>>>> For this point:
>>>>>>
>>>>>> > For a refresh operation the query engine has to parse the SQL and
>>>>>> fully expand the lineage with it's children anyway.  So the lineage is 
>>>>>> not
>>>>>> strictly required.
>>>>>>
>>>>>> If the lineage is provided at creation time by the respective engine,
>>>>>> the refresh operation does not need to parse the SQL, correct?
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> As the table I created is not properly shown in the mailing list I'll
>>>>>> reformat the summary of the different drawbacks again:
>>>>>>
>>>>>> Drawbacks of (no lineage, refresh-state key = identifier):
>>>>>>
>>>>>> - introduces catalog identifiers into table metadata (#4)
>>>>>> - query engine has to expand lineage at refresh time (happens anyway)
>>>>>>
>>>>>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>>>>>
>>>>>> - recursive calls to catalog to expand lineage at read time (#2)
>>>>>> - fragile by requiring child views to have lineage field
>>>>>>
>>>>>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>>>>>
>>>>>> - update of materialized view version required if child view is
>>>>>> updated (#5)
>>>>>> On 16.08.24 09:17, Jan Kaul wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks Micah for clearly stating the requirements. I think this gives
>>>>>> better clarity for the discussion.
>>>>>>
>>>>>> It seems like we don't have a solution that satisfies all
>>>>>> requirements at once. So we would need to choose which has the fewest
>>>>>> drawbacks.
>>>>>>
>>>>>> I would like to summarize the different drawbacks that came up in the
>>>>>> discussion.
>>>>>> no lineage
>>>>>> + refresh-state key = identifier
>>>>>> normalized lineage
>>>>>> + refresh-state key = uuid
>>>>>> denormalized lineage
>>>>>> + refresh-state key = uuid
>>>>>> - introduces catalog identifiers into table metadata (#4)
>>>>>> - query engine has to expand lineage at refresh time (happens anyway)
>>>>>> - recursive calls to catalog to expand lineage at read time (#2)
>>>>>> - fragile by requiring child views to have lineage field
>>>>>> - update of materialized view version required if child view is
>>>>>> updated (#5)
>>>>>>
>>>>>> With identifiers as the refresh-state keys, the lineage is not
>>>>>> strictly required and becomes an orthogonal proposal. That's why I left 
>>>>>> it
>>>>>> out if the comparison.
>>>>>>
>>>>>> In my opinion introducing catalog identifiers into the table metadata
>>>>>> (requirement #4) is the least significant drawback as it is not a 
>>>>>> technical
>>>>>> reason but more about semantics. Especially as the identifiers are not
>>>>>> introduced into the table spec but are rather stored in the snapshot
>>>>>> summary. That's why I'm in favor of using the catalog identifiers as the
>>>>>> refresh-state keys.
>>>>>>
>>>>>> Regarding your last point Walaa:
>>>>>>
>>>>>> The option of using catalog identifiers in the state map still
>>>>>> requires keeping lineage information in the view because REFRESH MV needs
>>>>>> the latest fully expanded children (which could have changed from the set
>>>>>> of children currently in the state map), without reparsing the view tree.
>>>>>>
>>>>>> For a refresh operation the query engine has to parse the SQL and
>>>>>> fully expand the lineage with it's children anyway.  So the lineage is 
>>>>>> not
>>>>>> strictly required.
>>>>>>
>>>>>> If I understand correctly, most of you are also in favor of using
>>>>>> catalog identifiers + ref as the refresh-state keys and postponing the
>>>>>> lineage proposal.
>>>>>>
>>>>>> I hope that we can move the discussion forward.
>>>>>>
>>>>>> Jan
>>>>>> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>>>>>>
>>>>>> The option of using catalog identifiers in the state map still
>>>>>> requires keeping lineage information in the view because REFRESH MV needs
>>>>>> the latest fully expanded children (which could have changed from the set
>>>>>> of children currently in the state map), without reparsing the view tree.
>>>>>> Therefore, catalog identifiers in the state map, does not eliminate the
>>>>>> need for tracking children in the form of catalog identifiers in the
>>>>>> lineage side (but in this case lineage will be a set instead of just a
>>>>>> map).
>>>>>>
>>>>>> Hence, my concerns with using catalog identifiers (as opposed to
>>>>>> UUIDs) are:
>>>>>> * The fundamental issue where the table spec depends on/refers to the
>>>>>> view spec (because such catalog identifiers are not defined in the table
>>>>>> spec and the only place they have a meaning is in the view spec lineage
>>>>>> information).
>>>>>> * (less fundamental) The denormalization introduced by this
>>>>>> arrangement, where each identifier is 3-parts and all of them repeat in
>>>>>> both lineage info and state map.
>>>>>>
>>>>>> I am not very concerned with recursive expansion (through multiple
>>>>>> calls), as it is always the case with views.
>>>>>>
>>>>>> On a positive note, looks like we agree to move past sequence numbers
>>>>>> :)
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> I think given the constraint that catalog lookup has to be by
>>>>>> identifier and not UUID, I'd prefer using identifier in the refresh 
>>>>>> state.
>>>>>> If we use identifiers, we can directly parallelize the catalog calls to
>>>>>> fetch the latest state.  If we use UUID, the engine has to go back to the
>>>>>> MV and possibly additional views to reconstruct the lineage map.  It's 
>>>>>> just
>>>>>> a lot slower and more work for the engine when there is a MV that
>>>>>> references a lot of views (and those views reference additional views).
>>>>>>
>>>>>>
>>>>>> I'm +1 on using catalog identifiers as the key.  As you point out
>>>>>> this is inline with #2 (try to minimize serial catalog lookups) in 
>>>>>> addition
>>>>>> to supporting requirement #3.
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <[email protected]> wrote:
>>>>>>
>>>>>> I think given the constraint that catalog lookup has to be by
>>>>>> identifier and not UUID, I'd prefer using identifier in the refresh 
>>>>>> state.
>>>>>> If we use identifiers, we can directly parallelize the catalog calls to
>>>>>> fetch the latest state.  If we use UUID, the engine has to go back to the
>>>>>> MV and possibly additional views to reconstruct the lineage map.  It's 
>>>>>> just
>>>>>> a lot slower and more work for the engine when there is a MV that
>>>>>> references a lot of views (and those views reference additional views).
>>>>>>
>>>>>> Thanks
>>>>>> Benny
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Thanks Jan, Micah, and Karuppayya for chiming in.
>>>>>>
>>>>>> I do not think 3 and 4 are at odds with each other (for example
>>>>>> maintaining both lineage map and state map through UUID can achieve 
>>>>>> both).
>>>>>> Also, I do not think we can drop the lineage map since in many catalogs,
>>>>>> the only lookup method is by the catalog identifier, and not the UUID.
>>>>>>
>>>>>> I think if we go with UUIDs in the state, we should have a lineage
>>>>>> map (from identifiers to UUIDs) to go with it.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs,
>>>>>> and deferring the inclusion of lineage to a future iteration.(like Micha
>>>>>> mentioned)
>>>>>> This would greatly simplify the current design.
>>>>>>
>>>>>> Also in terms of identifiers to use(UUID or catalog identifier) for
>>>>>> the refresh state
>>>>>> We will not be able to fetch the table/View using the UUID alone, for
>>>>>> example from Hive based catalog.
>>>>>> We do not have the direct mapping between UUID and table/view.
>>>>>> Which leaves us only with the catalog identifiers?
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Karuppayya
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> I think it might be worth restating perceived requirements and making
>>>>>> sure there is alignment on them.
>>>>>>
>>>>>> If I am reading correctly, I think the following are perceived
>>>>>> requirements:
>>>>>> 1. An engine must be able to unambiguously detect that an underlying
>>>>>> queried entity has changed or not via metadata to decide if materialized
>>>>>> table data can be used.
>>>>>> 2. The number of sequential catalog reads an engine needs to make to
>>>>>> make use of a materialized table state at read time is minimized.
>>>>>> 3. Engines that don't understand a SQL dialect can still use MV
>>>>>> information if it is not stale.
>>>>>> 4. Table refs (catalog identifiers) should not appear in the
>>>>>> materialized table metadata (i.e. state).
>>>>>> 5. The view part of the MV definition should not need a new revision
>>>>>> for any changes to objects it queries as long as their schemas stay
>>>>>> compatible (only state information on the materialized table need to
>>>>>> change).
>>>>>>
>>>>>> In my mind, requirement 1, is the only true requirement.  I think
>>>>>> this necessitates having UUID + snapshot ID as part of the state
>>>>>> information (not necessarily part of the Lineage).  I think it also
>>>>>> necessitates having a denormalized view of all entities that are inputs
>>>>>> into the MV in the state information (a view object might not change but
>>>>>> its underlying tables or views could change and that must be detected).
>>>>>>
>>>>>> Requirements 2 and 5 are somewhat at odds with each other.  If
>>>>>> information is denormalized (fully expanded) in Lineage, it means if 
>>>>>> table
>>>>>> information is somehow dropped from an intermediate view, one would need 
>>>>>> to
>>>>>> update the view (or make excess calls to the catalog). In my mind, this
>>>>>> argues for normalization of the lineage stored on the view (with the cost
>>>>>> of potentially 1 additional serial catalog lookup once the state
>>>>>> information is retrieved).
>>>>>>
>>>>>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>>>>>> keeping #4 (and as Jan noted #4 adds complexity).
>>>>>>
>>>>>> I think the last remaining question is if lineage serves any
>>>>>> purpose.  I think it is useful for the following reasons:
>>>>>> a)  When there are no intermediate views queried, it allows for fully
>>>>>> parallelized lookup calls to the catalog without having to parse the SQL
>>>>>> statement first
>>>>>> b)  Allows tools that don't need to lookup state information  or
>>>>>> parse SQL but still navigate MV/view trees.
>>>>>>
>>>>>> Both of these seem relatively minor, so lineage could perhaps be left
>>>>>> out in the first iteration.
>>>>>>
>>>>>> As it applies to Jan's questions:
>>>>>>
>>>>>> 1. Should we move the identifiers out of the refresh-state into a new
>>>>>> lineage record that is stored as part of the view metadata?
>>>>>>
>>>>>> No, I don't think so, I think #5 is a reasonable requirement and I
>>>>>> think this violates it.
>>>>>>
>>>>>>
>>>>>> 2. If yes, should the lineage in the view be fully expanded?
>>>>>>
>>>>>> No, I think only the state should be fully expanded (for reasons
>>>>>> mentioned above, it potentially requires more updates to the view then
>>>>>> necessary).
>>>>>>
>>>>>>
>>>>>> 3. What should be used as an identifier in the lineage to reference
>>>>>> entries in the refresh-state?
>>>>>>
>>>>>>
>>>>>> Catalog identifiers make sense to me.  If we agree requirement #3 is
>>>>>> not a requirement then it seems like this could also be UUIDs.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <[email protected]> wrote:
>>>>>>
>>>>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId
>>>>>> in the refresh state, then this list is fully expanded already.  So, to
>>>>>> validate the freshness of a materialization, the engine doesn't even need
>>>>>> to look at the view lineage.  IMO, the view lineage is nice to have but 
>>>>>> not
>>>>>> a necessary requirement for MVs.  The view lineage makes sharing of views
>>>>>> between engines without common SQL dialects possible.
>>>>>>
>>>>>> Benny
>>>>>>
>>>>>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul
>>>>>> <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I would like to reemphasize the purpose of the refresh-state for
>>>>>> materialized views. The purpose is to determine if the precomputed data 
>>>>>> is
>>>>>> fresh, stale or invalid. For that the current snapshot-id of every table 
>>>>>> in
>>>>>> the query tree has to be fetched from the catalog by using its full
>>>>>> identifier and ref. Additionally the refresh state stores the snapshot-id
>>>>>> of the last refresh.
>>>>>>
>>>>>> To summarize: *To determine the freshness of the precomputed data we
>>>>>> require the full identifier + ref and snapshot-id of the last refresh for
>>>>>> every table in the fully expanded query tree*
>>>>>>
>>>>>> This is a requirement from how the catalog works and independent from
>>>>>> how we design the lineage/refresh state. Additionally we previously 
>>>>>> agreed
>>>>>> that we should be able to obtain the full list of identifiers without
>>>>>> needing to parse the SQL definition.
>>>>>>
>>>>>> Now we are having a discussion in how to store and obtain the fully
>>>>>> expanded list of table identifiers and snapshot-ids. To move the 
>>>>>> discussion
>>>>>> forward I think it would be valuable to answer the following 3 questions:
>>>>>>
>>>>>> 1. Should we move the identifiers out of the refresh-state into a new
>>>>>> lineage record that is stored as part of the view metadata?
>>>>>>
>>>>>> 2. If yes, should the lineage in the view be fully expanded?
>>>>>>
>>>>>> 3. What should be used as an identifier in the lineage to reference
>>>>>> entries in the refresh-state?
>>>>>>
>>>>>> 1. Question:
>>>>>>
>>>>>> We already agreed that this would be a good idea because we wouldn't
>>>>>> introduce the identifier concept to the table metadata. However, looking 
>>>>>> at
>>>>>> the complexity that comes with the alternatives, I would like to keep 
>>>>>> this
>>>>>> question open.
>>>>>>
>>>>>> 2. Question:
>>>>>>
>>>>>> I'm against using a not fully expanded lineage in the view struct. To
>>>>>> recall we require every identifier in the fully expanded query tree to
>>>>>> determine the freshness. Not storing all identifiers in the lineage would
>>>>>> mean to recursively call the catalog and expand the query tree at read
>>>>>> time. This can lead to a large overhead for determining the refresh state
>>>>>> compared to expanding the query tree once at creation time and then 
>>>>>> storing
>>>>>> the fully expanded lineage.
>>>>>>
>>>>>> 3. Question:
>>>>>>
>>>>>> This depends on Question 2.
>>>>>>
>>>>>> For a not fully expanded lineage, the only options would be uuids or
>>>>>> catalog identifiers.
>>>>>>
>>>>>> For a fully expanded lineage the question isn't all that relevant.
>>>>>> The current design specifies that the lineage is a map from an identifier
>>>>>> to an id and the refresh-state is a map from such id to a snapshot-id. 
>>>>>> For
>>>>>> this to work we don't have to specify which kind of identifier has to be
>>>>>> used. One query engine could use uuids, the other engine sequence-ids. 
>>>>>> The
>>>>>> important assumption we are making is that every id that is used in the
>>>>>> refresh-state has to be defined in the lineage.
>>>>>> So the question about using uuids is rather, can the query engine
>>>>>> trust that the id defined in the lineage is the uuid of the table.
>>>>>>
>>>>>>
>>>>>> Regarding the complexity that comes from introducing the lineage in
>>>>>> the view I would like to revisit question 1. Introducing the lineage in 
>>>>>> the
>>>>>> view metadata opens up the question of when should the lineage be fully
>>>>>> expanded. We see that we have 3 options:
>>>>>>
>>>>>> 1. Not fully expanded lineage -> Expansion at read time
>>>>>>
>>>>>> 2. Fully expanded lineage -> Expansion at creation time
>>>>>>
>>>>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at
>>>>>> refresh time
>>>>>>
>>>>>> As reading is expected to be the most frequent operation I see option
>>>>>> 1 as not favorable. As the query engine has to fully expand the query 
>>>>>> tree
>>>>>> for a refresh anyway, I see option 3 as the most natural. For a refresh
>>>>>> operation the query engine must understand the SQL dialects of all views 
>>>>>> in
>>>>>> the query tree and therefore is guaranteed to successfully expand the
>>>>>> lineage. This might not be the case at creation time, which makes option 
>>>>>> 2
>>>>>> less favorable.
>>>>>>
>>>>>> As can be seen, I'm in favor of just storing the refresh-state as a
>>>>>> map from identifier to snapshot-id and not using the lineage. I know that
>>>>>> this introduces the concept of a catalog identifiers to the table 
>>>>>> metadata
>>>>>> spec, but in my opinion it is by far the simplest option.
>>>>>>
>>>>>> I'm interested in your opinions.
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>> Jan
>>>>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>>>>>>
>>>>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional
>>>>>> ref, although we can iterate ohe exact JSON structure (e.g., another 
>>>>>> option
>>>>>> is splitting for (UUID) state from (UUID + ref) state into two separate
>>>>>> higher-level fields).
>>>>>>
>>>>>> Generally agree on REFRESH VIEW strategy could be up to the engine,
>>>>>> but it seems like an area where Iceberg could have an opinion/spec on. I
>>>>>> will start a separate thread for that.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>>

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to