Hi Walaa I left comments in your spec PR: https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169 My last question about use cases was really about incremental refresh with aggregates. But I think this might be too complicated to try to model/discuss now and so I agree with Micah's comment about doing it in a future iteration.
Hi Jan, Regarding storing the identifiers, I like the idea too. Dremio's query engine supports MVs on sources besides Iceberg tables. Here's everything that's in a single lineage entry: https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80 The lineage is stored as a graph and not a list of entries. I think for what we are trying to achieve, it's more practical to limit the lineage to Iceberg sources. Thanks Benny On Wed, May 15, 2024 at 12:06 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > I agree with Szehon and Benny that storing the lineage information as > multiple table properties is too brittle, especially for many source tables > (base tables). I would prefer to have the whole lineage information in one > entry as it is more concise. This is also how Trino has been doing it, as > you can see here > <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915> > . > > As we've discussed in the google doc > <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>: > it is helpful to also store the table identifiers of the source tables to > enable clients to determine the freshness of the MV that don't understand > the SQL dialect of the MV definition, like other query engines, BI tools > and Dataframe libraries. This is also how Trino is doing it. That's why we > chose the design in the google doc. > > Storing the storage table identifier as a property might work. > > Thanks, Jan > On 15.05.24 02:38, Walaa Eldin Moustafa wrote: > > Thanks Benny. My specific thoughts about the spec and the properties are > captured in the spec PR https://github.com/apache/iceberg/pull/10280. The > spec is also implemented in the Spark implementation PR > https://github.com/apache/iceberg/pull/9830, and I believe this follows > the same nature of how the information was captured in Netflix's > implementation with Spark, and Trino implementation (prior to formalizing > through that spec), both of which have been used reliably for years. I > think it also aligns with Ryan's feedback here > https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 which > indicated the usage of properties. > > The reasons for choosing properties: > * Not every table is a storage table and not every view is a materialized > view. I feel exposing the info as top level metadata is an overkill for the > original object type. > * The properties are simple. They contain either single snapshot ID each, > or single view version each, or lastly, the storage table identifier. > Engines can use them without issues (also as shown in the implementation). > * To be meaningful, the metadata fields should be captured in the engine > API as well, which is an effort that has to be pursued outside the Iceberg > community. Until engine APIs evolve, we would have to define a mapping > between Iceberg metadata fields and engine properties (only current place > in engine side to capture this info). This requires an additional spec on > its own, and it will introduce complexities. Hence it is always cleaner to > map Iceberg properties to engine properties and Iceberg metadata to > designated engine APIs. Mixing and matching (e.g., Iceberg metadata fields > as engine properties) just for the lack of other cleaner options does not > sound like a good idea in both short and long term. > > Let me know your thoughts. > > Thanks, > Walaa. > > > > On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com> wrote: > >> I agree with Szheon here. I think storing the materialization lineage as >> a bunch of properties is brittle. This lineage information is needed by >> engines to validate the staleness of a materialization and also to perform >> full or incremental refreshes. There’s a lot to capture here. >> >> Maybe we should drill down into the use cases first - such as incremental >> refresh with aggregates? (Pick a harder one first 😀) >> >> I left a few comments about this in the doc. I wonder what are your >> thoughts here Walaa? >> >> Thanks >> >> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa <wa.moust...@gmail.com> >> wrote: >> >> >> Thanks John. The current metadata does not sound complex. We need to >> track the underlying table snapshot IDs as well as the view version ID. I >> agree as long as it is simple and before this feature fully matures, we >> should track it in properties. >> >> One important factor for me (apart from the API effort, especially on the >> engine side), is that not each table is an MV storage table. Surfacing >> MV-specific storage table properties as first class table metadata sounds >> to impose this metadata on every table, when it is not required for normal >> table operation (yes, they can be optional, but it does not sound like the >> use case warrants exposing them as metadata fields yet). >> >> Similarly, since not every view is a materialized view, it sounds >> reasonable to keep MV-specific data in properties. >> >> UUID (for views), on the other hand, is common to all views, hence it >> made sense to add it as a top level field. >> >> Thanks, >> Walaa. >> >> >> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org> wrote: >> >>> Hi Szheon, >>> >>> While I fully share your concern of abusing table properties, we took >>> the approach of option 1 and run it in production for several years: >>> >>> - the feature was still evolving >>> - quick and simple implementation >>> - table properties are simple enough and not confusing >>> - haven't seen an urgent need to convert the properties to metadata >>> fields and add API >>> - do not wish on-disk changes (requiring lengthy tedious migration) >>> >>> >>> That said, I am open to codifying the mv metadata into api and spec, >>> with the following considerations >>> >>> - mv metadata has reached maturity and consensus (could be just a >>> core portion) >>> - when mv metadata becomes too complex >>> - wish to support use cases that are quicker to adopt API changes >>> (than engines), e.g., using Iceberg library to manipulate MVs, or parsing >>> metadata files directly >>> - Spark view catalog API can evolve separately from Iceberg API and >>> spec changes >>> >>> >>> Thanks all for the great discussion! >>> >>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> Hi Szheon, >>>> >>>> Thanks for the follow-up. It is possible some of the concerns were >>>> referring to the backend catalogs, but it is all connected. My main >>>> personal concern is from the engine connector APIs point of view, but I >>>> share the concern about the catalogs. >>>> >>>> I think everyone's concern is not about the complexity* per* backend >>>> catalog/engine catalog API (in which case adding new metadata to >>>> existing objects could require less code), but rather about the >>>> *number* of APIs and implementations that need to change (in >>>> which case both new metadata to existing objects and new objects altogether >>>> introduce equal complexity). >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Hi Walaa >>>>> >>>>> OK thanks for confirming. I am still not 100% in agreement, my >>>>> understanding of the rationale for separate Table/View objects in the >>>>> comment that you linked: >>>>> >>>>> I think the biggest problem with this is that we would need to modify >>>>>> every catalog to support this combination and that would be really >>>>>> difficult. >>>>> >>>>> >>>>> is about JavaCatalogs /REST Catalog needing to to support creating , >>>>> persisting, and loading a MaterializedView object, which is much more >>>>> complex. See HiveView PR for example : >>>>> https://github.com/apache/iceberg/pull/9852 We would have to do the >>>>> same exercise for persisting MV. >>>>> >>>>> In our case though, there's not much complexity regardless of approach >>>>> ('properties' or new metadata fields), in terms of Java Catalog/REST >>>>> Catalog. It's mostly pass-through to storage. Looks like you are >>>>> referring to Spark's View model in terms of complexity, which may be a >>>>> different story, but not sure if it is a good rationale to make Iceberg to >>>>> use 'properties' . >>>>> >>>>> 'properties' is for read/write configurations, not to save >>>>> metadatas. To me, its also brittle to save important metadata, as it's >>>>> not >>>>> in the defined schema. >>>>> >>>>> A string to string map of table properties. This is used to control >>>>>> settings that affect reading and writing and is not intended to be used >>>>>> for >>>>>> arbitrary metadata. For example, commit.retry.num-retries is used >>>>>> to control the number of commit retries. >>>>> >>>>> >>>>> On the other hand, the Draft Spec suggests to save `lineage` as a >>>>> modeled field on the Storage Table's snapshot metadata. This allows you >>>>> to >>>>> 'time travel', 'branch', and have this metadata life cycle integrated via >>>>> normal snapshots lifecycle operations. >>>>> >>>>> So that's my rationale. Not sure if we can come to an agreement over >>>>> email though, and may need others to chime in as well. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> Hi Szehon, >>>>>> >>>>>> Yes, you are reading the PR correctly, and interpreting the meaning >>>>>> of properties correctly. I think the reply you pasted from Ryan refers to >>>>>> the same concept as well. >>>>>> >>>>>> For the initial Google doc and the issue (by the way it is an issue, >>>>>> not a PR), yes both are proposing new metadata fields. >>>>>> >>>>>> The references I made to the modeling doc [1, 2] are reasons why new >>>>>> APIs are not desired. The cons/concerns applicable to new MV metadata >>>>>> apply >>>>>> by extension to new table and view metadata fields. >>>>>> >>>>>> The reason why new metadata adds complexity is that this new metadata >>>>>> needs to be propagated to the engine API. For example, here is the >>>>>> ViewInfo >>>>>> [3] class in the Spark catalog, which is used in view methods like >>>>>> createView. Its fields correspond with the Iceberg metadata. Adding new >>>>>> Iceberg fields should be accompanied with new fields in the engine >>>>>> catalog/connector APIs, which was a major reason for rejecting the >>>>>> combined >>>>>> MV object model as well. >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>>>>> [2] >>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>>>>> [3] >>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45 >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Walaa >>>>>>> >>>>>>> As there may be confusion in the word 'properties', I want to double >>>>>>> check if we are talking about the same thing here. >>>>>>> >>>>>>> I am reading your PR as adding lineage metadata as new key/value >>>>>>> pair under the storage Table's 'properties' field: >>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677 >>>>>>> >>>>>>> *optional* *optional* *properties* A string to string map of table >>>>>>> properties. This is used to control settings that affect reading and >>>>>>> writing and is not intended to be used for arbitrary metadata. For >>>>>>> example, >>>>>>> commit.retry.num-retries is used to control the number of commit >>>>>>> retries. >>>>>>> and adding Storage Table pointer as key/value pair in the View's >>>>>>> 'properties' field: >>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65 >>>>>>> >>>>>>> *optional* properties A string to string map of view properties [2] >>>>>>> Is that correct? >>>>>>> >>>>>>> On the other hand, I was talking about adding this metadata as >>>>>>> actual fields, as is described in the Draft Spec of the Design Doc >>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A >>>>>>> and >>>>>>> first PR https://github.com/apache/iceberg/issues/6420 . >>>>>>> >>>>>>> Do you mean, the vote means we cannot model new fields like >>>>>>> 'materialization' and 'lineage' as was proposed there ? If that is >>>>>>> the >>>>>>> interpretation, I am not sure I agree. I dont fully see how new fields >>>>>>> adds more catalog implementation complexity over new key/value >>>>>>> properties? >>>>>>> To me, the vote seemed to just rule out using a combined catalog object >>>>>>> (MaterializedView) in favor of re-using the Table and View metadata >>>>>>> models, >>>>>>> not to prevent change to the Table and View model. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> >>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Szehon, >>>>>>>> >>>>>>>> I think choosing separate view + table objects precludes us from >>>>>>>> adding new metadata to table and view metadata. Here is one relevant >>>>>>>> comment [1] from Ryan on the modeling doc, where his point is that we >>>>>>>> want >>>>>>>> to avoid introducing new APIs since it requires updating every >>>>>>>> catalog, and >>>>>>>> (quoting) even now, we have few implementations that support views >>>>>>>> because >>>>>>>> of the problems updating back ends. Therefore, one of the major >>>>>>>> reasons to >>>>>>>> avoid a new model with new metadata is to avoid adding new metadata, >>>>>>>> which >>>>>>>> introduces this complexity. Here is another similar comment from >>>>>>>> Renjie [2] >>>>>>>> on the cons listed for the combined object approach. >>>>>>>> >>>>>>>> Even Ryan's point on the MV issue that you referenced reads to me >>>>>>>> as he is supportive of the property model. Here are some quotes: >>>>>>>> >>>>>>>> > We would still want some MV metadata in table *properties*. >>>>>>>> >>>>>>>> > I recommend instead reusing the existing snapshot metadata >>>>>>>> structure to store what you need as snapshot *properties*. >>>>>>>> >>>>>>>> > First, I think we want to avoid keeping much state information in >>>>>>>> complex table *properties*. >>>>>>>> >>>>>>>> Again, here, he is supportive of table properties, but wants to >>>>>>>> make sure that the information is simple. >>>>>>>> >>>>>>>> > We may want additional metadata as well, like a UUID to ensure we >>>>>>>> have the right view. I don't think we have a UUID in the view spec >>>>>>>> yet, but >>>>>>>> we could add one. >>>>>>>> >>>>>>>> Here, he is very specific when it comes to new metadata fields, and >>>>>>>> explicitly calls it out. That is the only new metadata field in that >>>>>>>> reply >>>>>>>> and by now it is already supported. It is also not MV-specific. >>>>>>>> >>>>>>>> Hope this addresses your question on the property vs new metadata >>>>>>>> model. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>>>>>>> [2] >>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Walaa, >>>>>>>>> >>>>>>>>> I agree, I definitely do not want yet another pr/doc where >>>>>>>>> discussion happens. as its already quite spread out :) But did not >>>>>>>>> want to >>>>>>>>> clarify some points before we get started on the discussion on your >>>>>>>>> PR. >>>>>>>>> >>>>>>>>> With reusing the table and view objects, we are not changing the >>>>>>>>>> existing metadata of either table or view spec but rather introduce >>>>>>>>>> new >>>>>>>>>> properties and formalize them to express materialized views >>>>>>>>>> >>>>>>>>> >>>>>>>>> On this point, I am not 100% sure that choosing to represent a >>>>>>>>> MaterializedView as a separate View + Table object precludes us from >>>>>>>>> adding >>>>>>>>> to metadata of Table or View as the Draft Spec suggested, though. I >>>>>>>>> think >>>>>>>>> this point was discussed in Jan's initial PR with a good point from >>>>>>>>> Ryan: >>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 >>>>>>>>> that >>>>>>>>> using Table Properties to track lineage is fairly brittle, and having >>>>>>>>> it >>>>>>>>> formalized in the Iceberg metadata is cleaner, and that was thus the >>>>>>>>> direction of the Draft Spec in the design doc. What do people think? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks Szehon. >>>>>>>>>> >>>>>>>>>> The reason for the difference is that the proposal in the Google >>>>>>>>>> doc is based on a new MV model, hence, new metadata fields and a new >>>>>>>>>> metadata model were being introduced (with types, optionality, etc). >>>>>>>>>> With >>>>>>>>>> reusing the table and view objects, we are not changing the existing >>>>>>>>>> metadata of either table or view spec but rather introduce new >>>>>>>>>> properties >>>>>>>>>> and formalize them to express materialized views. This would be the >>>>>>>>>> answer >>>>>>>>>> to most of the questions you posted on the PR (besides some naming >>>>>>>>>> questions, which I think should be straightforward). >>>>>>>>>> >>>>>>>>>> With that fundamental difference, we cannot lift and shift what >>>>>>>>>> is in the doc to any PR. Further, having consensus on separate table >>>>>>>>>> and >>>>>>>>>> view objects contradicts with the point being made on having >>>>>>>>>> consensus on >>>>>>>>>> the doc. We might have had agreements on some elements, but >>>>>>>>>> definitely not >>>>>>>>>> on the whole doc, proven by the follow ups (also as a community, not >>>>>>>>>> individuals). >>>>>>>>>> >>>>>>>>>> Therefore: we need a new space to discuss the separate table and >>>>>>>>>> view properties. >>>>>>>>>> >>>>>>>>>> Is the question whether to: >>>>>>>>>> 1- Create a new doc >>>>>>>>>> 2- Create a new PR? >>>>>>>>>> >>>>>>>>>> I feel a PR is the most effective way, especially given the fact >>>>>>>>>> that we discussed the topic a lot by now. If we agree, we can >>>>>>>>>> continue the >>>>>>>>>> discussion on the PR, else, we can create a doc. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Walaa for driving it forward, looking forward to thinking >>>>>>>>>>> about implementation of Materialized Views. >>>>>>>>>>> >>>>>>>>>>> I see Jan's point, the PR spec change is similar but does not >>>>>>>>>>> seem to be completely aligned with the Draft Spec in the design doc: >>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/ >>>>>>>>>>> . I left my comments on PR of those sections with the links to the >>>>>>>>>>> difference. I think most of those Draft Spec proposal is still >>>>>>>>>>> applicable >>>>>>>>>>> after the decision to have separate Table and View objects It will >>>>>>>>>>> be >>>>>>>>>>> interesting to at least see drill a bit further why we did not >>>>>>>>>>> choose the >>>>>>>>>>> approach in the Draft Spec and chose another way. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Szehon >>>>>>>>>>> >>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul >>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Well, everybody that actively contributed to the discussion on >>>>>>>>>>>> the original google doc was in consensus. That's why I brought up >>>>>>>>>>>> the topic >>>>>>>>>>>> at the Community Sync on the 2024-02-14 ( >>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the awareness of >>>>>>>>>>>> the broader community. After which the discussion about the >>>>>>>>>>>> storage model >>>>>>>>>>>> started. I don't think that the discussion about a single aspect >>>>>>>>>>>> of a >>>>>>>>>>>> proposal should invalidate all other aspects of the proposal. >>>>>>>>>>>> >>>>>>>>>>>> Regardless, the state of the proposal from the original google >>>>>>>>>>>> doc contains a lot of valuable contributions from Micah, Szehon, >>>>>>>>>>>> Jack, Dan, >>>>>>>>>>>> yourself and others and it should at least provide the basis for >>>>>>>>>>>> any >>>>>>>>>>>> further discussion. I don't think it's effective to start with a >>>>>>>>>>>> completely >>>>>>>>>>>> different design because we are bound to have the same discussions >>>>>>>>>>>> all over >>>>>>>>>>>> again. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, Jan >>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote: >>>>>>>>>>>> >>>>>>>>>>>> The only consensus the community had was on the object model >>>>>>>>>>>> through the most recent voting thread [1]. This kind of consensus >>>>>>>>>>>> was not >>>>>>>>>>>> present during the doc discussions, and this should be evident >>>>>>>>>>>> from the >>>>>>>>>>>> fact the last doc state listed 5 alternatives with no particular >>>>>>>>>>>> conclusion. I am not quite sure what type of consensus we are >>>>>>>>>>>> referring to >>>>>>>>>>>> here given all the follow up discussions, alternatives, etc. >>>>>>>>>>>> >>>>>>>>>>>> Due to the separate object model, the PR is fundamentally >>>>>>>>>>>> different from the doc in the sense it does not propose a new >>>>>>>>>>>> metadata >>>>>>>>>>>> model but rather formalizes some new table and view properties >>>>>>>>>>>> related to >>>>>>>>>>>> MVs. That is also one reason there are no repeated discussions. >>>>>>>>>>>> That said, >>>>>>>>>>>> if you feel there is a repeated discussion (which I do not see so >>>>>>>>>>>> far), it >>>>>>>>>>>> would be best to link the relevant discussion from the doc in a >>>>>>>>>>>> comment. >>>>>>>>>>>> >>>>>>>>>>>> Happy to move the discussion elsewhere if there is >>>>>>>>>>>> sufficient support for this idea, but as things stand, I do not >>>>>>>>>>>> see this as >>>>>>>>>>>> an efficient way to make progress. It sounds we have been >>>>>>>>>>>> re-emphasizing >>>>>>>>>>>> the same points in the last two replies, so I will let others >>>>>>>>>>>> chime in at >>>>>>>>>>>> this point. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Walaa. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul >>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The original google doc >>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> >>>>>>>>>>>>> discussed multiple aspects of the Materialized View spec. One was >>>>>>>>>>>>> the >>>>>>>>>>>>> storage model while others were related to the metadata. After we >>>>>>>>>>>>> (Micah, >>>>>>>>>>>>> Szehon, you, me) reached consensus in the google doc, Jack raised >>>>>>>>>>>>> his >>>>>>>>>>>>> concern about the storage model and the long discussion about the >>>>>>>>>>>>> storage >>>>>>>>>>>>> model started. Now we truly reached consensus about the storage >>>>>>>>>>>>> model, >>>>>>>>>>>>> which is now also reflected in the google doc. All other aspects >>>>>>>>>>>>> from the >>>>>>>>>>>>> google doc about the metadata weren't questioned and still >>>>>>>>>>>>> represent the >>>>>>>>>>>>> consensus. >>>>>>>>>>>>> >>>>>>>>>>>>> I would like to *avoid repeating the discussions* in your PR >>>>>>>>>>>>> that we already had in the google doc. Especially since we reached >>>>>>>>>>>>> consensus which took a considerable amount of time. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, Jan >>>>>>>>>>>>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks Jan. I think we moved on to more alignment steps beyond >>>>>>>>>>>>> that doc a while ago. After that doc, we have discussed the topic >>>>>>>>>>>>> further >>>>>>>>>>>>> in 2 dev list threads and one more doc >>>>>>>>>>>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1> >>>>>>>>>>>>> (with strictly two options for the storage model to consider). >>>>>>>>>>>>> Moreover, >>>>>>>>>>>>> the original doc grew to 14 pages long with one section comparing >>>>>>>>>>>>> 5 design >>>>>>>>>>>>> alternatives, which made things harder to reach consensus. The >>>>>>>>>>>>> lack of >>>>>>>>>>>>> consensus is what partly led up to the subsequent discussions and >>>>>>>>>>>>> call for >>>>>>>>>>>>> a more focused approach to reach consensus. If we already have a >>>>>>>>>>>>> consensus >>>>>>>>>>>>> on the storage model (separate tables and views), I think we >>>>>>>>>>>>> should take >>>>>>>>>>>>> things further and have continued focused discussions on the >>>>>>>>>>>>> specific >>>>>>>>>>>>> metadata in the form of a PR. I have included all previous >>>>>>>>>>>>> discussions >>>>>>>>>>>>> including the original doc and issue as references in the PR >>>>>>>>>>>>> description. >>>>>>>>>>>>> Please let me know if this works. Happy to hear others' thoughts >>>>>>>>>>>>> on the >>>>>>>>>>>>> best way to move forward. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, May 8, 2024 at 12:56 AM Jan Kaul >>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Walaa for trying to move things along. However I don't >>>>>>>>>>>>>> think it's a good idea to start a separate discussion about the >>>>>>>>>>>>>> metadata >>>>>>>>>>>>>> for materialized views because we already had this discussion >>>>>>>>>>>>>> and reached >>>>>>>>>>>>>> consensus in this google doc: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing >>>>>>>>>>>>>> >>>>>>>>>>>>>> Once the draft is finalized we can adopt the PR to reflect >>>>>>>>>>>>>> the consensus from the google doc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best wishes, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan >>>>>>>>>>>>>> On 07.05.24 19:11, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Steven. I feel it is needed so the MV spec is not >>>>>>>>>>>>>> scattered across the table and view spec pages. We may add a >>>>>>>>>>>>>> reference in >>>>>>>>>>>>>> each respective properties section. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, May 7, 2024 at 10:04 AM Steven Wu < >>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Walaa, thanks for initiating the next step. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> With the agreed model of separate view and storage table, I >>>>>>>>>>>>>>> am wondering if a separate materialized view spec page is >>>>>>>>>>>>>>> needed. E.g., the >>>>>>>>>>>>>>> new view metadata (view-materialized and view-storage-table) is >>>>>>>>>>>>>>> probably >>>>>>>>>>>>>>> good to be added to the view page directly to avoid information >>>>>>>>>>>>>>> scattering. >>>>>>>>>>>>>>> The same can be said about the storage table metadata. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We may keep the separate materialized view page to document >>>>>>>>>>>>>>> motivation, freshness semantics, etc.. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, May 6, 2024 at 10:58 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks again for participating in the modeling discussion >>>>>>>>>>>>>>>> [1]. Since the outcome of this discussion was to model >>>>>>>>>>>>>>>> materialized views >>>>>>>>>>>>>>>> as separate objects, an Iceberg view and a table, I think the >>>>>>>>>>>>>>>> next step >>>>>>>>>>>>>>>> should be discussing the metadata details for each object. I >>>>>>>>>>>>>>>> have created a >>>>>>>>>>>>>>>> PR https://github.com/apache/iceberg/pull/10280 with an >>>>>>>>>>>>>>>> initial spec improvement. Please feel free to review it and >>>>>>>>>>>>>>>> leave feedback >>>>>>>>>>>>>>>> there. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>> >>> -- >>> John Zhuge >>> >>