Thanks Jan, Micah, and Karuppayya for chiming in. I do not think 3 and 4 are at odds with each other (for example maintaining both lineage map and state map through UUID can achieve both). Also, I do not think we can drop the lineage map since in many catalogs, the only lookup method is by the catalog identifier, and not the UUID.
I think if we go with UUIDs in the state, we should have a lineage map (from identifiers to UUIDs) to go with it. Thanks, Walaa. On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com> wrote: > +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and > deferring the inclusion of lineage to a future iteration.(like Micha > mentioned) > This would greatly simplify the current design. > > Also in terms of identifiers to use(UUID or catalog identifier) for the > refresh state > We will not be able to fetch the table/View using the UUID alone, for > example from Hive based catalog. > We do not have the direct mapping between UUID and table/view. > Which leaves us only with the catalog identifiers? > > Thanks & Regards > Karuppayya > > > On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I think it might be worth restating perceived requirements and making >> sure there is alignment on them. >> >> If I am reading correctly, I think the following are perceived >> requirements: >> 1. An engine must be able to unambiguously detect that an underlying >> queried entity has changed or not via metadata to decide if materialized >> table data can be used. >> 2. The number of sequential catalog reads an engine needs to make to make >> use of a materialized table state at read time is minimized. >> 3. Engines that don't understand a SQL dialect can still use MV >> information if it is not stale. >> 4. Table refs (catalog identifiers) should not appear in the materialized >> table metadata (i.e. state). >> 5. The view part of the MV definition should not need a new revision for >> any changes to objects it queries as long as their schemas stay compatible >> (only state information on the materialized table need to change). >> >> In my mind, requirement 1, is the only true requirement. I think this >> necessitates having UUID + snapshot ID as part of the state information >> (not necessarily part of the Lineage). I think it also necessitates having >> a denormalized view of all entities that are inputs into the MV in the >> state information (a view object might not change but its underlying tables >> or views could change and that must be detected). >> >> Requirements 2 and 5 are somewhat at odds with each other. If >> information is denormalized (fully expanded) in Lineage, it means if table >> information is somehow dropped from an intermediate view, one would need to >> update the view (or make excess calls to the catalog). In my mind, this >> argues for normalization of the lineage stored on the view (with the cost >> of potentially 1 additional serial catalog lookup once the state >> information is retrieved). >> >> I think #3 is at odds with #4. I think #3 is more worthwhile, then >> keeping #4 (and as Jan noted #4 adds complexity). >> >> I think the last remaining question is if lineage serves any purpose. I >> think it is useful for the following reasons: >> a) When there are no intermediate views queried, it allows for fully >> parallelized lookup calls to the catalog without having to parse the SQL >> statement first >> b) Allows tools that don't need to lookup state information or parse >> SQL but still navigate MV/view trees. >> >> Both of these seem relatively minor, so lineage could perhaps be left out >> in the first iteration. >> >> As it applies to Jan's questions: >> >> 1. Should we move the identifiers out of the refresh-state into a new >>> lineage record that is stored as part of the view metadata? >> >> No, I don't think so, I think #5 is a reasonable requirement and I think >> this violates it. >> >> >>> 2. If yes, should the lineage in the view be fully expanded? >> >> No, I think only the state should be fully expanded (for reasons >> mentioned above, it potentially requires more updates to the view then >> necessary). >> >> >>> 3. What should be used as an identifier in the lineage to reference >>> entries in the refresh-state? >> >> >> Catalog identifiers make sense to me. If we agree requirement #3 is not >> a requirement then it seems like this could also be UUIDs. >> >> Thanks, >> Micah >> >> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote: >> >>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in >>> the refresh state, then this list is fully expanded already. So, to >>> validate the freshness of a materialization, the engine doesn't even need >>> to look at the view lineage. IMO, the view lineage is nice to have but not >>> a necessary requirement for MVs. The view lineage makes sharing of views >>> between engines without common SQL dialects possible. >>> >>> Benny >>> >>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >>>> Hi all, >>>> >>>> I would like to reemphasize the purpose of the refresh-state for >>>> materialized views. The purpose is to determine if the precomputed data is >>>> fresh, stale or invalid. For that the current snapshot-id of every table in >>>> the query tree has to be fetched from the catalog by using its full >>>> identifier and ref. Additionally the refresh state stores the snapshot-id >>>> of the last refresh. >>>> >>>> To summarize: *To determine the freshness of the precomputed data we >>>> require the full identifier + ref and snapshot-id of the last refresh for >>>> every table in the fully expanded query tree* >>>> >>>> This is a requirement from how the catalog works and independent from >>>> how we design the lineage/refresh state. Additionally we previously agreed >>>> that we should be able to obtain the full list of identifiers without >>>> needing to parse the SQL definition. >>>> >>>> Now we are having a discussion in how to store and obtain the fully >>>> expanded list of table identifiers and snapshot-ids. To move the discussion >>>> forward I think it would be valuable to answer the following 3 questions: >>>> >>>> 1. Should we move the identifiers out of the refresh-state into a new >>>> lineage record that is stored as part of the view metadata? >>>> >>>> 2. If yes, should the lineage in the view be fully expanded? >>>> >>>> 3. What should be used as an identifier in the lineage to reference >>>> entries in the refresh-state? >>>> >>>> 1. Question: >>>> >>>> We already agreed that this would be a good idea because we wouldn't >>>> introduce the identifier concept to the table metadata. However, looking at >>>> the complexity that comes with the alternatives, I would like to keep this >>>> question open. >>>> >>>> 2. Question: >>>> >>>> I'm against using a not fully expanded lineage in the view struct. To >>>> recall we require every identifier in the fully expanded query tree to >>>> determine the freshness. Not storing all identifiers in the lineage would >>>> mean to recursively call the catalog and expand the query tree at read >>>> time. This can lead to a large overhead for determining the refresh state >>>> compared to expanding the query tree once at creation time and then storing >>>> the fully expanded lineage. >>>> >>>> 3. Question: >>>> >>>> This depends on Question 2. >>>> >>>> For a not fully expanded lineage, the only options would be uuids or >>>> catalog identifiers. >>>> >>>> For a fully expanded lineage the question isn't all that relevant. The >>>> current design specifies that the lineage is a map from an identifier to an >>>> id and the refresh-state is a map from such id to a snapshot-id. For this >>>> to work we don't have to specify which kind of identifier has to be used. >>>> One query engine could use uuids, the other engine sequence-ids. The >>>> important assumption we are making is that every id that is used in the >>>> refresh-state has to be defined in the lineage. >>>> So the question about using uuids is rather, can the query engine trust >>>> that the id defined in the lineage is the uuid of the table. >>>> >>>> >>>> Regarding the complexity that comes from introducing the lineage in the >>>> view I would like to revisit question 1. Introducing the lineage in the >>>> view metadata opens up the question of when should the lineage be fully >>>> expanded. We see that we have 3 options: >>>> >>>> 1. Not fully expanded lineage -> Expansion at read time >>>> >>>> 2. Fully expanded lineage -> Expansion at creation time >>>> >>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at >>>> refresh time >>>> >>>> As reading is expected to be the most frequent operation I see option 1 >>>> as not favorable. As the query engine has to fully expand the query tree >>>> for a refresh anyway, I see option 3 as the most natural. For a refresh >>>> operation the query engine must understand the SQL dialects of all views in >>>> the query tree and therefore is guaranteed to successfully expand the >>>> lineage. This might not be the case at creation time, which makes option 2 >>>> less favorable. >>>> >>>> As can be seen, I'm in favor of just storing the refresh-state as a map >>>> from identifier to snapshot-id and not using the lineage. I know that this >>>> introduces the concept of a catalog identifiers to the table metadata spec, >>>> but in my opinion it is by far the simplest option. >>>> >>>> I'm interested in your opinions. >>>> >>>> Best wishes, >>>> >>>> Jan >>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote: >>>> >>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional >>>> ref, although we can iterate ohe exact JSON structure (e.g., another option >>>> is splitting for (UUID) state from (UUID + ref) state into two separate >>>> higher-level fields). >>>> >>>> Generally agree on REFRESH VIEW strategy could be up to the engine, but >>>> it seems like an area where Iceberg could have an opinion/spec on. I will >>>> start a separate thread for that. >>>> >>>> Thanks, >>>> Walaa. >>>> >>>>