Hi Szheon, While I fully share your concern of abusing table properties, we took the approach of option 1 and run it in production for several years:
- the feature was still evolving - quick and simple implementation - table properties are simple enough and not confusing - haven't seen an urgent need to convert the properties to metadata fields and add API - do not wish on-disk changes (requiring lengthy tedious migration) That said, I am open to codifying the mv metadata into api and spec, with the following considerations - mv metadata has reached maturity and consensus (could be just a core portion) - when mv metadata becomes too complex - wish to support use cases that are quicker to adopt API changes (than engines), e.g., using Iceberg library to manipulate MVs, or parsing metadata files directly - Spark view catalog API can evolve separately from Iceberg API and spec changes Thanks all for the great discussion! On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Szheon, > > Thanks for the follow-up. It is possible some of the concerns were > referring to the backend catalogs, but it is all connected. My main > personal concern is from the engine connector APIs point of view, but I > share the concern about the catalogs. > > I think everyone's concern is not about the complexity* per* backend > catalog/engine catalog API (in which case adding new metadata to > existing objects could require less code), but rather about the *number* of > APIs and implementations that need to change (in which case both new > metadata to existing objects and new objects altogether introduce equal > complexity). > > Thanks, > Walaa. > > > On Fri, May 10, 2024 at 10:31 AM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hi Walaa >> >> OK thanks for confirming. I am still not 100% in agreement, my >> understanding of the rationale for separate Table/View objects in the >> comment that you linked: >> >> I think the biggest problem with this is that we would need to modify >>> every catalog to support this combination and that would be really >>> difficult. >> >> >> is about JavaCatalogs /REST Catalog needing to to support creating , >> persisting, and loading a MaterializedView object, which is much more >> complex. See HiveView PR for example : >> https://github.com/apache/iceberg/pull/9852 We would have to do the >> same exercise for persisting MV. >> >> In our case though, there's not much complexity regardless of approach >> ('properties' or new metadata fields), in terms of Java Catalog/REST >> Catalog. It's mostly pass-through to storage. Looks like you are >> referring to Spark's View model in terms of complexity, which may be a >> different story, but not sure if it is a good rationale to make Iceberg to >> use 'properties' . >> >> 'properties' is for read/write configurations, not to save metadatas. >> To me, its also brittle to save important metadata, as it's not in the >> defined schema. >> >> A string to string map of table properties. This is used to control >>> settings that affect reading and writing and is not intended to be used for >>> arbitrary metadata. For example, commit.retry.num-retries is used to >>> control the number of commit retries. >> >> >> On the other hand, the Draft Spec suggests to save `lineage` as a modeled >> field on the Storage Table's snapshot metadata. This allows you to 'time >> travel', 'branch', and have this metadata life cycle integrated via normal >> snapshots lifecycle operations. >> >> So that's my rationale. Not sure if we can come to an agreement over >> email though, and may need others to chime in as well. >> >> Thanks >> Szehon >> >> >> >> >> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Hi Szehon, >>> >>> Yes, you are reading the PR correctly, and interpreting the meaning of >>> properties correctly. I think the reply you pasted from Ryan refers to the >>> same concept as well. >>> >>> For the initial Google doc and the issue (by the way it is an issue, not >>> a PR), yes both are proposing new metadata fields. >>> >>> The references I made to the modeling doc [1, 2] are reasons why new >>> APIs are not desired. The cons/concerns applicable to new MV metadata apply >>> by extension to new table and view metadata fields. >>> >>> The reason why new metadata adds complexity is that this new metadata >>> needs to be propagated to the engine API. For example, here is the ViewInfo >>> [3] class in the Spark catalog, which is used in view methods like >>> createView. Its fields correspond with the Iceberg metadata. Adding new >>> Iceberg fields should be accompanied with new fields in the engine >>> catalog/connector APIs, which was a major reason for rejecting the combined >>> MV object model as well. >>> >>> [1] >>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>> [2] >>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>> [3] >>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45 >>> >>> Thanks, >>> Walaa. >>> >>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Hi Walaa >>>> >>>> As there may be confusion in the word 'properties', I want to double >>>> check if we are talking about the same thing here. >>>> >>>> I am reading your PR as adding lineage metadata as new key/value pair >>>> under the storage Table's 'properties' field: >>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677 >>>> >>>> *optional* *optional* *properties* A string to string map of table >>>> properties. This is used to control settings that affect reading and >>>> writing and is not intended to be used for arbitrary metadata. For example, >>>> commit.retry.num-retries is used to control the number of commit >>>> retries. >>>> and adding Storage Table pointer as key/value pair in the View's >>>> 'properties' field: >>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65 >>>> >>>> *optional* properties A string to string map of view properties [2] >>>> Is that correct? >>>> >>>> On the other hand, I was talking about adding this metadata as actual >>>> fields, as is described in the Draft Spec of the Design Doc >>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A >>>> and >>>> first PR https://github.com/apache/iceberg/issues/6420 . >>>> >>>> Do you mean, the vote means we cannot model new fields like >>>> 'materialization' and 'lineage' as was proposed there ? If that is the >>>> interpretation, I am not sure I agree. I dont fully see how new fields >>>> adds more catalog implementation complexity over new key/value properties? >>>> To me, the vote seemed to just rule out using a combined catalog object >>>> (MaterializedView) in favor of re-using the Table and View metadata models, >>>> not to prevent change to the Table and View model. >>>> >>>> Thanks >>>> Szehon >>>> >>>> >>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> Hi Szehon, >>>>> >>>>> I think choosing separate view + table objects precludes us from >>>>> adding new metadata to table and view metadata. Here is one relevant >>>>> comment [1] from Ryan on the modeling doc, where his point is that we want >>>>> to avoid introducing new APIs since it requires updating every catalog, >>>>> and >>>>> (quoting) even now, we have few implementations that support views because >>>>> of the problems updating back ends. Therefore, one of the major reasons to >>>>> avoid a new model with new metadata is to avoid adding new metadata, which >>>>> introduces this complexity. Here is another similar comment from Renjie >>>>> [2] >>>>> on the cons listed for the combined object approach. >>>>> >>>>> Even Ryan's point on the MV issue that you referenced reads to me as >>>>> he is supportive of the property model. Here are some quotes: >>>>> >>>>> > We would still want some MV metadata in table *properties*. >>>>> >>>>> > I recommend instead reusing the existing snapshot metadata structure >>>>> to store what you need as snapshot *properties*. >>>>> >>>>> > First, I think we want to avoid keeping much state information in >>>>> complex table *properties*. >>>>> >>>>> Again, here, he is supportive of table properties, but wants to make >>>>> sure that the information is simple. >>>>> >>>>> > We may want additional metadata as well, like a UUID to ensure we >>>>> have the right view. I don't think we have a UUID in the view spec yet, >>>>> but >>>>> we could add one. >>>>> >>>>> Here, he is very specific when it comes to new metadata fields, and >>>>> explicitly calls it out. That is the only new metadata field in that reply >>>>> and by now it is already supported. It is also not MV-specific. >>>>> >>>>> Hope this addresses your question on the property vs new metadata >>>>> model. >>>>> >>>>> [1] >>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>>>> [2] >>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Walaa, >>>>>> >>>>>> I agree, I definitely do not want yet another pr/doc where discussion >>>>>> happens. as its already quite spread out :) But did not want to clarify >>>>>> some points before we get started on the discussion on your PR. >>>>>> >>>>>> With reusing the table and view objects, we are not changing the >>>>>>> existing metadata of either table or view spec but rather introduce new >>>>>>> properties and formalize them to express materialized views >>>>>>> >>>>>> >>>>>> On this point, I am not 100% sure that choosing to represent a >>>>>> MaterializedView as a separate View + Table object precludes us from >>>>>> adding >>>>>> to metadata of Table or View as the Draft Spec suggested, though. I >>>>>> think >>>>>> this point was discussed in Jan's initial PR with a good point from Ryan: >>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 >>>>>> that >>>>>> using Table Properties to track lineage is fairly brittle, and having it >>>>>> formalized in the Iceberg metadata is cleaner, and that was thus the >>>>>> direction of the Draft Spec in the design doc. What do people think? >>>>>> >>>>>> Thanks >>>>>> Szehon >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Szehon. >>>>>>> >>>>>>> The reason for the difference is that the proposal in the Google doc >>>>>>> is based on a new MV model, hence, new metadata fields and a new >>>>>>> metadata >>>>>>> model were being introduced (with types, optionality, etc). With reusing >>>>>>> the table and view objects, we are not changing the existing metadata of >>>>>>> either table or view spec but rather introduce new properties and >>>>>>> formalize >>>>>>> them to express materialized views. This would be the answer to most of >>>>>>> the >>>>>>> questions you posted on the PR (besides some naming questions, which I >>>>>>> think should be straightforward). >>>>>>> >>>>>>> With that fundamental difference, we cannot lift and shift what is >>>>>>> in the doc to any PR. Further, having consensus on separate table and >>>>>>> view >>>>>>> objects contradicts with the point being made on having consensus on the >>>>>>> doc. We might have had agreements on some elements, but definitely not >>>>>>> on >>>>>>> the whole doc, proven by the follow ups (also as a community, not >>>>>>> individuals). >>>>>>> >>>>>>> Therefore: we need a new space to discuss the separate table and >>>>>>> view properties. >>>>>>> >>>>>>> Is the question whether to: >>>>>>> 1- Create a new doc >>>>>>> 2- Create a new PR? >>>>>>> >>>>>>> I feel a PR is the most effective way, especially given the fact >>>>>>> that we discussed the topic a lot by now. If we agree, we can continue >>>>>>> the >>>>>>> discussion on the PR, else, we can create a doc. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Walaa for driving it forward, looking forward to thinking >>>>>>>> about implementation of Materialized Views. >>>>>>>> >>>>>>>> I see Jan's point, the PR spec change is similar but does not seem >>>>>>>> to be completely aligned with the Draft Spec in the design doc: >>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/ >>>>>>>> . I left my comments on PR of those sections with the links to the >>>>>>>> difference. I think most of those Draft Spec proposal is still >>>>>>>> applicable >>>>>>>> after the decision to have separate Table and View objects It will be >>>>>>>> interesting to at least see drill a bit further why we did not choose >>>>>>>> the >>>>>>>> approach in the Draft Spec and chose another way. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul <jank...@mailbox.org.invalid> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Well, everybody that actively contributed to the discussion on the >>>>>>>>> original google doc was in consensus. That's why I brought up the >>>>>>>>> topic at >>>>>>>>> the Community Sync on the 2024-02-14 ( >>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the awareness of the >>>>>>>>> broader community. After which the discussion about the storage model >>>>>>>>> started. I don't think that the discussion about a single aspect of a >>>>>>>>> proposal should invalidate all other aspects of the proposal. >>>>>>>>> >>>>>>>>> Regardless, the state of the proposal from the original google doc >>>>>>>>> contains a lot of valuable contributions from Micah, Szehon, Jack, >>>>>>>>> Dan, >>>>>>>>> yourself and others and it should at least provide the basis for any >>>>>>>>> further discussion. I don't think it's effective to start with a >>>>>>>>> completely >>>>>>>>> different design because we are bound to have the same discussions >>>>>>>>> all over >>>>>>>>> again. >>>>>>>>> >>>>>>>>> Thanks, Jan >>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>>>>>> The only consensus the community had was on the object model >>>>>>>>> through the most recent voting thread [1]. This kind of consensus was >>>>>>>>> not >>>>>>>>> present during the doc discussions, and this should be evident from >>>>>>>>> the >>>>>>>>> fact the last doc state listed 5 alternatives with no particular >>>>>>>>> conclusion. I am not quite sure what type of consensus we are >>>>>>>>> referring to >>>>>>>>> here given all the follow up discussions, alternatives, etc. >>>>>>>>> >>>>>>>>> Due to the separate object model, the PR is fundamentally >>>>>>>>> different from the doc in the sense it does not propose a new metadata >>>>>>>>> model but rather formalizes some new table and view properties >>>>>>>>> related to >>>>>>>>> MVs. That is also one reason there are no repeated discussions. That >>>>>>>>> said, >>>>>>>>> if you feel there is a repeated discussion (which I do not see so >>>>>>>>> far), it >>>>>>>>> would be best to link the relevant discussion from the doc in a >>>>>>>>> comment. >>>>>>>>> >>>>>>>>> Happy to move the discussion elsewhere if there is >>>>>>>>> sufficient support for this idea, but as things stand, I do not see >>>>>>>>> this as >>>>>>>>> an efficient way to make progress. It sounds we have been >>>>>>>>> re-emphasizing >>>>>>>>> the same points in the last two replies, so I will let others chime >>>>>>>>> in at >>>>>>>>> this point. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>>>>>> The original google doc >>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> >>>>>>>>>> discussed multiple aspects of the Materialized View spec. One was the >>>>>>>>>> storage model while others were related to the metadata. After we >>>>>>>>>> (Micah, >>>>>>>>>> Szehon, you, me) reached consensus in the google doc, Jack raised his >>>>>>>>>> concern about the storage model and the long discussion about the >>>>>>>>>> storage >>>>>>>>>> model started. Now we truly reached consensus about the storage >>>>>>>>>> model, >>>>>>>>>> which is now also reflected in the google doc. All other aspects >>>>>>>>>> from the >>>>>>>>>> google doc about the metadata weren't questioned and still represent >>>>>>>>>> the >>>>>>>>>> consensus. >>>>>>>>>> >>>>>>>>>> I would like to *avoid repeating the discussions* in your PR >>>>>>>>>> that we already had in the google doc. Especially since we reached >>>>>>>>>> consensus which took a considerable amount of time. >>>>>>>>>> >>>>>>>>>> Thanks, Jan >>>>>>>>>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote: >>>>>>>>>> >>>>>>>>>> Thanks Jan. I think we moved on to more alignment steps beyond >>>>>>>>>> that doc a while ago. After that doc, we have discussed the topic >>>>>>>>>> further >>>>>>>>>> in 2 dev list threads and one more doc >>>>>>>>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1> >>>>>>>>>> (with strictly two options for the storage model to consider). >>>>>>>>>> Moreover, >>>>>>>>>> the original doc grew to 14 pages long with one section comparing 5 >>>>>>>>>> design >>>>>>>>>> alternatives, which made things harder to reach consensus. The lack >>>>>>>>>> of >>>>>>>>>> consensus is what partly led up to the subsequent discussions and >>>>>>>>>> call for >>>>>>>>>> a more focused approach to reach consensus. If we already have a >>>>>>>>>> consensus >>>>>>>>>> on the storage model (separate tables and views), I think we should >>>>>>>>>> take >>>>>>>>>> things further and have continued focused discussions on the specific >>>>>>>>>> metadata in the form of a PR. I have included all previous >>>>>>>>>> discussions >>>>>>>>>> including the original doc and issue as references in the PR >>>>>>>>>> description. >>>>>>>>>> Please let me know if this works. Happy to hear others' thoughts on >>>>>>>>>> the >>>>>>>>>> best way to move forward. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, May 8, 2024 at 12:56 AM Jan Kaul >>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Walaa for trying to move things along. However I don't >>>>>>>>>>> think it's a good idea to start a separate discussion about the >>>>>>>>>>> metadata >>>>>>>>>>> for materialized views because we already had this discussion and >>>>>>>>>>> reached >>>>>>>>>>> consensus in this google doc: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing >>>>>>>>>>> >>>>>>>>>>> Once the draft is finalized we can adopt the PR to reflect the >>>>>>>>>>> consensus from the google doc. >>>>>>>>>>> >>>>>>>>>>> Best wishes, >>>>>>>>>>> >>>>>>>>>>> Jan >>>>>>>>>>> On 07.05.24 19:11, Walaa Eldin Moustafa wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks Steven. I feel it is needed so the MV spec is not >>>>>>>>>>> scattered across the table and view spec pages. We may add a >>>>>>>>>>> reference in >>>>>>>>>>> each respective properties section. >>>>>>>>>>> >>>>>>>>>>> On Tue, May 7, 2024 at 10:04 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Walaa, thanks for initiating the next step. >>>>>>>>>>>> >>>>>>>>>>>> With the agreed model of separate view and storage table, I am >>>>>>>>>>>> wondering if a separate materialized view spec page is needed. >>>>>>>>>>>> E.g., the >>>>>>>>>>>> new view metadata (view-materialized and view-storage-table) is >>>>>>>>>>>> probably >>>>>>>>>>>> good to be added to the view page directly to avoid information >>>>>>>>>>>> scattering. >>>>>>>>>>>> The same can be said about the storage table metadata. >>>>>>>>>>>> >>>>>>>>>>>> We may keep the separate materialized view page to document >>>>>>>>>>>> motivation, freshness semantics, etc.. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 6, 2024 at 10:58 PM Walaa Eldin Moustafa < >>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks again for participating in the modeling discussion [1]. >>>>>>>>>>>>> Since the outcome of this discussion was to model materialized >>>>>>>>>>>>> views as >>>>>>>>>>>>> separate objects, an Iceberg view and a table, I think the next >>>>>>>>>>>>> step >>>>>>>>>>>>> should be discussing the metadata details for each object. I have >>>>>>>>>>>>> created a >>>>>>>>>>>>> PR https://github.com/apache/iceberg/pull/10280 with an >>>>>>>>>>>>> initial spec improvement. Please feel free to review it and leave >>>>>>>>>>>>> feedback >>>>>>>>>>>>> there. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa. >>>>>>>>>>>>> >>>>>>>>>>>>> -- John Zhuge