Thanks Jan, Micah, and Karuppayya for chiming in.

I do not think 3 and 4 are at odds with each other (for example
maintaining both lineage map and state map through UUID can achieve both).
Also, I do not think we can drop the lineage map since in many catalogs,
the only lookup method is by the catalog identifier, and not the UUID.

I think if we go with UUIDs in the state, we should have a lineage map
(from identifiers to UUIDs) to go with it.

Thanks,
Walaa.


On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com> wrote:

> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
> deferring the inclusion of lineage to a future iteration.(like Micha
> mentioned)
> This would greatly simplify the current design.
>
> Also in terms of identifiers to use(UUID or catalog identifier) for the
> refresh state
> We will not be able to fetch the table/View using the UUID alone, for
> example from Hive based catalog.
> We do not have the direct mapping between UUID and table/view.
> Which leaves us only with the catalog identifiers?
>
> Thanks & Regards
> Karuppayya
>
>
> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> I think it might be worth restating perceived requirements and making
>> sure there is alignment on them.
>>
>> If I am reading correctly, I think the following are perceived
>> requirements:
>> 1. An engine must be able to unambiguously detect that an underlying
>> queried entity has changed or not via metadata to decide if materialized
>> table data can be used.
>> 2. The number of sequential catalog reads an engine needs to make to make
>> use of a materialized table state at read time is minimized.
>> 3. Engines that don't understand a SQL dialect can still use MV
>> information if it is not stale.
>> 4. Table refs (catalog identifiers) should not appear in the materialized
>> table metadata (i.e. state).
>> 5. The view part of the MV definition should not need a new revision for
>> any changes to objects it queries as long as their schemas stay compatible
>> (only state information on the materialized table need to change).
>>
>> In my mind, requirement 1, is the only true requirement.  I think this
>> necessitates having UUID + snapshot ID as part of the state information
>> (not necessarily part of the Lineage).  I think it also necessitates having
>> a denormalized view of all entities that are inputs into the MV in the
>> state information (a view object might not change but its underlying tables
>> or views could change and that must be detected).
>>
>> Requirements 2 and 5 are somewhat at odds with each other.  If
>> information is denormalized (fully expanded) in Lineage, it means if table
>> information is somehow dropped from an intermediate view, one would need to
>> update the view (or make excess calls to the catalog). In my mind, this
>> argues for normalization of the lineage stored on the view (with the cost
>> of potentially 1 additional serial catalog lookup once the state
>> information is retrieved).
>>
>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>> keeping #4 (and as Jan noted #4 adds complexity).
>>
>> I think the last remaining question is if lineage serves any purpose.  I
>> think it is useful for the following reasons:
>> a)  When there are no intermediate views queried, it allows for fully
>> parallelized lookup calls to the catalog without having to parse the SQL
>> statement first
>> b)  Allows tools that don't need to lookup state information  or parse
>> SQL but still navigate MV/view trees.
>>
>> Both of these seem relatively minor, so lineage could perhaps be left out
>> in the first iteration.
>>
>> As it applies to Jan's questions:
>>
>> 1. Should we move the identifiers out of the refresh-state into a new
>>> lineage record that is stored as part of the view metadata?
>>
>> No, I don't think so, I think #5 is a reasonable requirement and I think
>> this violates it.
>>
>>
>>> 2. If yes, should the lineage in the view be fully expanded?
>>
>> No, I think only the state should be fully expanded (for reasons
>> mentioned above, it potentially requires more updates to the view then
>> necessary).
>>
>>
>>> 3. What should be used as an identifier in the lineage to reference
>>> entries in the refresh-state?
>>
>>
>> Catalog identifiers make sense to me.  If we agree requirement #3 is not
>> a requirement then it seems like this could also be UUIDs.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote:
>>
>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
>>> the refresh state, then this list is fully expanded already.  So, to
>>> validate the freshness of a materialization, the engine doesn't even need
>>> to look at the view lineage.  IMO, the view lineage is nice to have but not
>>> a necessary requirement for MVs.  The view lineage makes sharing of views
>>> between engines without common SQL dialects possible.
>>>
>>> Benny
>>>
>>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to reemphasize the purpose of the refresh-state for
>>>> materialized views. The purpose is to determine if the precomputed data is
>>>> fresh, stale or invalid. For that the current snapshot-id of every table in
>>>> the query tree has to be fetched from the catalog by using its full
>>>> identifier and ref. Additionally the refresh state stores the snapshot-id
>>>> of the last refresh.
>>>>
>>>> To summarize: *To determine the freshness of the precomputed data we
>>>> require the full identifier + ref and snapshot-id of the last refresh for
>>>> every table in the fully expanded query tree*
>>>>
>>>> This is a requirement from how the catalog works and independent from
>>>> how we design the lineage/refresh state. Additionally we previously agreed
>>>> that we should be able to obtain the full list of identifiers without
>>>> needing to parse the SQL definition.
>>>>
>>>> Now we are having a discussion in how to store and obtain the fully
>>>> expanded list of table identifiers and snapshot-ids. To move the discussion
>>>> forward I think it would be valuable to answer the following 3 questions:
>>>>
>>>> 1. Should we move the identifiers out of the refresh-state into a new
>>>> lineage record that is stored as part of the view metadata?
>>>>
>>>> 2. If yes, should the lineage in the view be fully expanded?
>>>>
>>>> 3. What should be used as an identifier in the lineage to reference
>>>> entries in the refresh-state?
>>>>
>>>> 1. Question:
>>>>
>>>> We already agreed that this would be a good idea because we wouldn't
>>>> introduce the identifier concept to the table metadata. However, looking at
>>>> the complexity that comes with the alternatives, I would like to keep this
>>>> question open.
>>>>
>>>> 2. Question:
>>>>
>>>> I'm against using a not fully expanded lineage in the view struct. To
>>>> recall we require every identifier in the fully expanded query tree to
>>>> determine the freshness. Not storing all identifiers in the lineage would
>>>> mean to recursively call the catalog and expand the query tree at read
>>>> time. This can lead to a large overhead for determining the refresh state
>>>> compared to expanding the query tree once at creation time and then storing
>>>> the fully expanded lineage.
>>>>
>>>> 3. Question:
>>>>
>>>> This depends on Question 2.
>>>>
>>>> For a not fully expanded lineage, the only options would be uuids or
>>>> catalog identifiers.
>>>>
>>>> For a fully expanded lineage the question isn't all that relevant. The
>>>> current design specifies that the lineage is a map from an identifier to an
>>>> id and the refresh-state is a map from such id to a snapshot-id. For this
>>>> to work we don't have to specify which kind of identifier has to be used.
>>>> One query engine could use uuids, the other engine sequence-ids. The
>>>> important assumption we are making is that every id that is used in the
>>>> refresh-state has to be defined in the lineage.
>>>> So the question about using uuids is rather, can the query engine trust
>>>> that the id defined in the lineage is the uuid of the table.
>>>>
>>>>
>>>> Regarding the complexity that comes from introducing the lineage in the
>>>> view I would like to revisit question 1. Introducing the lineage in the
>>>> view metadata opens up the question of when should the lineage be fully
>>>> expanded. We see that we have 3 options:
>>>>
>>>> 1. Not fully expanded lineage -> Expansion at read time
>>>>
>>>> 2. Fully expanded lineage -> Expansion at creation time
>>>>
>>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at
>>>> refresh time
>>>>
>>>> As reading is expected to be the most frequent operation I see option 1
>>>> as not favorable. As the query engine has to fully expand the query tree
>>>> for a refresh anyway, I see option 3 as the most natural. For a refresh
>>>> operation the query engine must understand the SQL dialects of all views in
>>>> the query tree and therefore is guaranteed to successfully expand the
>>>> lineage. This might not be the case at creation time, which makes option 2
>>>> less favorable.
>>>>
>>>> As can be seen, I'm in favor of just storing the refresh-state as a map
>>>> from identifier to snapshot-id and not using the lineage. I know that this
>>>> introduces the concept of a catalog identifiers to the table metadata spec,
>>>> but in my opinion it is by far the simplest option.
>>>>
>>>> I'm interested in your opinions.
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>>>>
>>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional
>>>> ref, although we can iterate ohe exact JSON structure (e.g., another option
>>>> is splitting for (UUID) state from (UUID + ref) state into two separate
>>>> higher-level fields).
>>>>
>>>> Generally agree on REFRESH VIEW strategy could be up to the engine, but
>>>> it seems like an area where Iceberg could have an opinion/spec on. I will
>>>> start a separate thread for that.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>

Reply via email to