Hi Walaa, I personally don't see a semantic issue with putting the table identifiers in the refresh state. The purpose of the refresh state is to basically take a snapshot of the table and view versions at the time of materialization. Directly using table identifiers seems pretty natural to me. So, I'm +1 for:
no lineage + refresh-state key = identifier Thanks Benny On Mon, Aug 19, 2024 at 9:19 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Micah, it is mostly about the typical results of denormalization such > as data consistency, management complexity, integrity, etc. However, as > mentioned earlier, the main reason would be the semantic gap around using > catalog table identifiers as a concept in the table (more specifically > snapshot summary) spec. Denormalization in this case is a minor issue. > > Thanks, > Walaa. > > > On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Thanks Micah, for the latter, I meant the type of denormalization of >>> repeating a 3-part name as opposed to using an ID. >> >> >> Is the concern here just metadata size or something else? For size I >> think if this is really anticipated to be a problem that it is likely for >> the state map in general, and we could investigate some more sophisticated >> encodings for the State map even without the overlap. I think maybe this >> could be handled if it proves to be a problem but hopefully engines are >> placing a reasonable cap on view depth + number of tables per view which >> puts an upper bound on overall size. >> >> Thanks, >> Micah >> >> On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Micah, for the latter, I meant the type of denormalization of >>> repeating a 3-part name as opposed to using an ID. >>> >>> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> However, this still does not address the semantic issue which is more >>>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog >>>>> table identifiers and this use will be the first break of this >>>>> abstraction. >>>> >>>> >>>> IIUC, based on Jan's comments, we are not going to modify the table >>>> specification. I thought the state map was effectively opaque metadata >>>> from the table specification perspective? If this is the case I feel like >>>> that is OK and not a blocker, I think by their nature as already discussed >>>> MVs need catalog information to function properly and the choice to put >>>> catalog information into the table metadata is pragmatic and preserves >>>> other desirable properties. It might be a more important point if we want >>>> to update the table specification (IMO, I still think it would probably be >>>> OK). >>>> >>>> On a side note, it does not address the denormalization issue either if >>>>> we ever want to introduce the lineage in the view as a nice-to-have. >>>> >>>> >>>> I think if lineage is introduced to the View metadata, it should only >>>> hold direct dependencies for the reasons already discussed. IMO, I think >>>> the potential overlap is OK as they serve two different purposes. >>>> >>>> Cheers, >>>> Micah >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> That is right. I agree that in the case of using catalog identifiers >>>>> in state information, using them in lineage information would be a >>>>> nice-to-have and not a requirement. >>>>> >>>>> However, this still does not address the semantic issue which is more >>>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog >>>>> table identifiers and this use will be the first break of this >>>>> abstraction. >>>>> >>>>> On a side note, it does not address the denormalization issue either >>>>> if we ever want to introduce the lineage in the view as a nice-to-have. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul <jank...@mailbox.org.invalid> >>>>> wrote: >>>>> >>>>>> Hi Walaa, >>>>>> >>>>>> I would argue that for the refresh operation the query engine has to >>>>>> parse the query and then somehow execute it. For a full refresh it will >>>>>> directly execute the query and for a incremental refresh it will execute >>>>>> a >>>>>> modified version. Therefore it has to fully expand the query tree. >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Jan >>>>>> >>>>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com>: >>>>>> >>>>>> Thanks Jan for the summary. >>>>>> >>>>>> For this point: >>>>>> >>>>>> > For a refresh operation the query engine has to parse the SQL and >>>>>> fully expand the lineage with it's children anyway. So the lineage is >>>>>> not >>>>>> strictly required. >>>>>> >>>>>> If the lineage is provided at creation time by the respective engine, >>>>>> the refresh operation does not need to parse the SQL, correct? >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul <jank...@mailbox.org.invalid> >>>>>> wrote: >>>>>> >>>>>> As the table I created is not properly shown in the mailing list I'll >>>>>> reformat the summary of the different drawbacks again: >>>>>> >>>>>> Drawbacks of (no lineage, refresh-state key = identifier): >>>>>> >>>>>> - introduces catalog identifiers into table metadata (#4) >>>>>> - query engine has to expand lineage at refresh time (happens anyway) >>>>>> >>>>>> Drawbacks of (normalized lineage, refresh-state key = uuid): >>>>>> >>>>>> - recursive calls to catalog to expand lineage at read time (#2) >>>>>> - fragile by requiring child views to have lineage field >>>>>> >>>>>> Drawbacks of (denormalized lineage, refresh-state key = uuid): >>>>>> >>>>>> - update of materialized view version required if child view is >>>>>> updated (#5) >>>>>> On 16.08.24 09:17, Jan Kaul wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks Micah for clearly stating the requirements. I think this gives >>>>>> better clarity for the discussion. >>>>>> >>>>>> It seems like we don't have a solution that satisfies all >>>>>> requirements at once. So we would need to choose which has the fewest >>>>>> drawbacks. >>>>>> >>>>>> I would like to summarize the different drawbacks that came up in the >>>>>> discussion. >>>>>> no lineage >>>>>> + refresh-state key = identifier >>>>>> normalized lineage >>>>>> + refresh-state key = uuid >>>>>> denormalized lineage >>>>>> + refresh-state key = uuid >>>>>> - introduces catalog identifiers into table metadata (#4) >>>>>> - query engine has to expand lineage at refresh time (happens anyway) >>>>>> - recursive calls to catalog to expand lineage at read time (#2) >>>>>> - fragile by requiring child views to have lineage field >>>>>> - update of materialized view version required if child view is >>>>>> updated (#5) >>>>>> >>>>>> With identifiers as the refresh-state keys, the lineage is not >>>>>> strictly required and becomes an orthogonal proposal. That's why I left >>>>>> it >>>>>> out if the comparison. >>>>>> >>>>>> In my opinion introducing catalog identifiers into the table metadata >>>>>> (requirement #4) is the least significant drawback as it is not a >>>>>> technical >>>>>> reason but more about semantics. Especially as the identifiers are not >>>>>> introduced into the table spec but are rather stored in the snapshot >>>>>> summary. That's why I'm in favor of using the catalog identifiers as the >>>>>> refresh-state keys. >>>>>> >>>>>> Regarding your last point Walaa: >>>>>> >>>>>> The option of using catalog identifiers in the state map still >>>>>> requires keeping lineage information in the view because REFRESH MV needs >>>>>> the latest fully expanded children (which could have changed from the set >>>>>> of children currently in the state map), without reparsing the view tree. >>>>>> >>>>>> For a refresh operation the query engine has to parse the SQL and >>>>>> fully expand the lineage with it's children anyway. So the lineage is >>>>>> not >>>>>> strictly required. >>>>>> >>>>>> If I understand correctly, most of you are also in favor of using >>>>>> catalog identifiers + ref as the refresh-state keys and postponing the >>>>>> lineage proposal. >>>>>> >>>>>> I hope that we can move the discussion forward. >>>>>> >>>>>> Jan >>>>>> On 16.08.24 08:07, Walaa Eldin Moustafa wrote: >>>>>> >>>>>> The option of using catalog identifiers in the state map still >>>>>> requires keeping lineage information in the view because REFRESH MV needs >>>>>> the latest fully expanded children (which could have changed from the set >>>>>> of children currently in the state map), without reparsing the view tree. >>>>>> Therefore, catalog identifiers in the state map, does not eliminate the >>>>>> need for tracking children in the form of catalog identifiers in the >>>>>> lineage side (but in this case lineage will be a set instead of just a >>>>>> map). >>>>>> >>>>>> Hence, my concerns with using catalog identifiers (as opposed to >>>>>> UUIDs) are: >>>>>> * The fundamental issue where the table spec depends on/refers to the >>>>>> view spec (because such catalog identifiers are not defined in the table >>>>>> spec and the only place they have a meaning is in the view spec lineage >>>>>> information). >>>>>> * (less fundamental) The denormalization introduced by this >>>>>> arrangement, where each identifier is 3-parts and all of them repeat in >>>>>> both lineage info and state map. >>>>>> >>>>>> I am not very concerned with recursive expansion (through multiple >>>>>> calls), as it is always the case with views. >>>>>> >>>>>> On a positive note, looks like we agree to move past sequence numbers >>>>>> :) >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>> I think given the constraint that catalog lookup has to be by >>>>>> identifier and not UUID, I'd prefer using identifier in the refresh >>>>>> state. >>>>>> If we use identifiers, we can directly parallelize the catalog calls to >>>>>> fetch the latest state. If we use UUID, the engine has to go back to the >>>>>> MV and possibly additional views to reconstruct the lineage map. It's >>>>>> just >>>>>> a lot slower and more work for the engine when there is a MV that >>>>>> references a lot of views (and those views reference additional views). >>>>>> >>>>>> >>>>>> I'm +1 on using catalog identifiers as the key. As you point out >>>>>> this is inline with #2 (try to minimize serial catalog lookups) in >>>>>> addition >>>>>> to supporting requirement #3. >>>>>> >>>>>> On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <btc...@gmail.com> wrote: >>>>>> >>>>>> I think given the constraint that catalog lookup has to be by >>>>>> identifier and not UUID, I'd prefer using identifier in the refresh >>>>>> state. >>>>>> If we use identifiers, we can directly parallelize the catalog calls to >>>>>> fetch the latest state. If we use UUID, the engine has to go back to the >>>>>> MV and possibly additional views to reconstruct the lineage map. It's >>>>>> just >>>>>> a lot slower and more work for the engine when there is a MV that >>>>>> references a lot of views (and those views reference additional views). >>>>>> >>>>>> Thanks >>>>>> Benny >>>>>> >>>>>> >>>>>> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>> Thanks Jan, Micah, and Karuppayya for chiming in. >>>>>> >>>>>> I do not think 3 and 4 are at odds with each other (for example >>>>>> maintaining both lineage map and state map through UUID can achieve >>>>>> both). >>>>>> Also, I do not think we can drop the lineage map since in many catalogs, >>>>>> the only lookup method is by the catalog identifier, and not the UUID. >>>>>> >>>>>> I think if we go with UUIDs in the state, we should have a lineage >>>>>> map (from identifiers to UUIDs) to go with it. >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, >>>>>> and deferring the inclusion of lineage to a future iteration.(like Micha >>>>>> mentioned) >>>>>> This would greatly simplify the current design. >>>>>> >>>>>> Also in terms of identifiers to use(UUID or catalog identifier) for >>>>>> the refresh state >>>>>> We will not be able to fetch the table/View using the UUID alone, for >>>>>> example from Hive based catalog. >>>>>> We do not have the direct mapping between UUID and table/view. >>>>>> Which leaves us only with the catalog identifiers? >>>>>> >>>>>> Thanks & Regards >>>>>> Karuppayya >>>>>> >>>>>> >>>>>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>> I think it might be worth restating perceived requirements and making >>>>>> sure there is alignment on them. >>>>>> >>>>>> If I am reading correctly, I think the following are perceived >>>>>> requirements: >>>>>> 1. An engine must be able to unambiguously detect that an underlying >>>>>> queried entity has changed or not via metadata to decide if materialized >>>>>> table data can be used. >>>>>> 2. The number of sequential catalog reads an engine needs to make to >>>>>> make use of a materialized table state at read time is minimized. >>>>>> 3. Engines that don't understand a SQL dialect can still use MV >>>>>> information if it is not stale. >>>>>> 4. Table refs (catalog identifiers) should not appear in the >>>>>> materialized table metadata (i.e. state). >>>>>> 5. The view part of the MV definition should not need a new revision >>>>>> for any changes to objects it queries as long as their schemas stay >>>>>> compatible (only state information on the materialized table need to >>>>>> change). >>>>>> >>>>>> In my mind, requirement 1, is the only true requirement. I think >>>>>> this necessitates having UUID + snapshot ID as part of the state >>>>>> information (not necessarily part of the Lineage). I think it also >>>>>> necessitates having a denormalized view of all entities that are inputs >>>>>> into the MV in the state information (a view object might not change but >>>>>> its underlying tables or views could change and that must be detected). >>>>>> >>>>>> Requirements 2 and 5 are somewhat at odds with each other. If >>>>>> information is denormalized (fully expanded) in Lineage, it means if >>>>>> table >>>>>> information is somehow dropped from an intermediate view, one would need >>>>>> to >>>>>> update the view (or make excess calls to the catalog). In my mind, this >>>>>> argues for normalization of the lineage stored on the view (with the cost >>>>>> of potentially 1 additional serial catalog lookup once the state >>>>>> information is retrieved). >>>>>> >>>>>> I think #3 is at odds with #4. I think #3 is more worthwhile, then >>>>>> keeping #4 (and as Jan noted #4 adds complexity). >>>>>> >>>>>> I think the last remaining question is if lineage serves any >>>>>> purpose. I think it is useful for the following reasons: >>>>>> a) When there are no intermediate views queried, it allows for fully >>>>>> parallelized lookup calls to the catalog without having to parse the SQL >>>>>> statement first >>>>>> b) Allows tools that don't need to lookup state information or >>>>>> parse SQL but still navigate MV/view trees. >>>>>> >>>>>> Both of these seem relatively minor, so lineage could perhaps be left >>>>>> out in the first iteration. >>>>>> >>>>>> As it applies to Jan's questions: >>>>>> >>>>>> 1. Should we move the identifiers out of the refresh-state into a new >>>>>> lineage record that is stored as part of the view metadata? >>>>>> >>>>>> No, I don't think so, I think #5 is a reasonable requirement and I >>>>>> think this violates it. >>>>>> >>>>>> >>>>>> 2. If yes, should the lineage in the view be fully expanded? >>>>>> >>>>>> No, I think only the state should be fully expanded (for reasons >>>>>> mentioned above, it potentially requires more updates to the view then >>>>>> necessary). >>>>>> >>>>>> >>>>>> 3. What should be used as an identifier in the lineage to reference >>>>>> entries in the refresh-state? >>>>>> >>>>>> >>>>>> Catalog identifiers make sense to me. If we agree requirement #3 is >>>>>> not a requirement then it seems like this could also be UUIDs. >>>>>> >>>>>> Thanks, >>>>>> Micah >>>>>> >>>>>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote: >>>>>> >>>>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId >>>>>> in the refresh state, then this list is fully expanded already. So, to >>>>>> validate the freshness of a materialization, the engine doesn't even need >>>>>> to look at the view lineage. IMO, the view lineage is nice to have but >>>>>> not >>>>>> a necessary requirement for MVs. The view lineage makes sharing of views >>>>>> between engines without common SQL dialects possible. >>>>>> >>>>>> Benny >>>>>> >>>>>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul >>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I would like to reemphasize the purpose of the refresh-state for >>>>>> materialized views. The purpose is to determine if the precomputed data >>>>>> is >>>>>> fresh, stale or invalid. For that the current snapshot-id of every table >>>>>> in >>>>>> the query tree has to be fetched from the catalog by using its full >>>>>> identifier and ref. Additionally the refresh state stores the snapshot-id >>>>>> of the last refresh. >>>>>> >>>>>> To summarize: *To determine the freshness of the precomputed data we >>>>>> require the full identifier + ref and snapshot-id of the last refresh for >>>>>> every table in the fully expanded query tree* >>>>>> >>>>>> This is a requirement from how the catalog works and independent from >>>>>> how we design the lineage/refresh state. Additionally we previously >>>>>> agreed >>>>>> that we should be able to obtain the full list of identifiers without >>>>>> needing to parse the SQL definition. >>>>>> >>>>>> Now we are having a discussion in how to store and obtain the fully >>>>>> expanded list of table identifiers and snapshot-ids. To move the >>>>>> discussion >>>>>> forward I think it would be valuable to answer the following 3 questions: >>>>>> >>>>>> 1. Should we move the identifiers out of the refresh-state into a new >>>>>> lineage record that is stored as part of the view metadata? >>>>>> >>>>>> 2. If yes, should the lineage in the view be fully expanded? >>>>>> >>>>>> 3. What should be used as an identifier in the lineage to reference >>>>>> entries in the refresh-state? >>>>>> >>>>>> 1. Question: >>>>>> >>>>>> We already agreed that this would be a good idea because we wouldn't >>>>>> introduce the identifier concept to the table metadata. However, looking >>>>>> at >>>>>> the complexity that comes with the alternatives, I would like to keep >>>>>> this >>>>>> question open. >>>>>> >>>>>> 2. Question: >>>>>> >>>>>> I'm against using a not fully expanded lineage in the view struct. To >>>>>> recall we require every identifier in the fully expanded query tree to >>>>>> determine the freshness. Not storing all identifiers in the lineage would >>>>>> mean to recursively call the catalog and expand the query tree at read >>>>>> time. This can lead to a large overhead for determining the refresh state >>>>>> compared to expanding the query tree once at creation time and then >>>>>> storing >>>>>> the fully expanded lineage. >>>>>> >>>>>> 3. Question: >>>>>> >>>>>> This depends on Question 2. >>>>>> >>>>>> For a not fully expanded lineage, the only options would be uuids or >>>>>> catalog identifiers. >>>>>> >>>>>> For a fully expanded lineage the question isn't all that relevant. >>>>>> The current design specifies that the lineage is a map from an identifier >>>>>> to an id and the refresh-state is a map from such id to a snapshot-id. >>>>>> For >>>>>> this to work we don't have to specify which kind of identifier has to be >>>>>> used. One query engine could use uuids, the other engine sequence-ids. >>>>>> The >>>>>> important assumption we are making is that every id that is used in the >>>>>> refresh-state has to be defined in the lineage. >>>>>> So the question about using uuids is rather, can the query engine >>>>>> trust that the id defined in the lineage is the uuid of the table. >>>>>> >>>>>> >>>>>> Regarding the complexity that comes from introducing the lineage in >>>>>> the view I would like to revisit question 1. Introducing the lineage in >>>>>> the >>>>>> view metadata opens up the question of when should the lineage be fully >>>>>> expanded. We see that we have 3 options: >>>>>> >>>>>> 1. Not fully expanded lineage -> Expansion at read time >>>>>> >>>>>> 2. Fully expanded lineage -> Expansion at creation time >>>>>> >>>>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at >>>>>> refresh time >>>>>> >>>>>> As reading is expected to be the most frequent operation I see option >>>>>> 1 as not favorable. As the query engine has to fully expand the query >>>>>> tree >>>>>> for a refresh anyway, I see option 3 as the most natural. For a refresh >>>>>> operation the query engine must understand the SQL dialects of all views >>>>>> in >>>>>> the query tree and therefore is guaranteed to successfully expand the >>>>>> lineage. This might not be the case at creation time, which makes option >>>>>> 2 >>>>>> less favorable. >>>>>> >>>>>> As can be seen, I'm in favor of just storing the refresh-state as a >>>>>> map from identifier to snapshot-id and not using the lineage. I know that >>>>>> this introduces the concept of a catalog identifiers to the table >>>>>> metadata >>>>>> spec, but in my opinion it is by far the simplest option. >>>>>> >>>>>> I'm interested in your opinions. >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Jan >>>>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote: >>>>>> >>>>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional >>>>>> ref, although we can iterate ohe exact JSON structure (e.g., another >>>>>> option >>>>>> is splitting for (UUID) state from (UUID + ref) state into two separate >>>>>> higher-level fields). >>>>>> >>>>>> Generally agree on REFRESH VIEW strategy could be up to the engine, >>>>>> but it seems like an area where Iceberg could have an opinion/spec on. I >>>>>> will start a separate thread for that. >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>>