Re: Materialized Views: Next Steps

Walaa Eldin Moustafa Thu, 16 May 2024 15:57:19 -0700

Hi Benny, I have responded to the comment.

I would suggest that we use this thread to evaluate properties model vs top
level metadata model (to avoid discussion drift).


If we have feedback on the actual properties used in the properties model
as defined in the PR, we can have the discussion there.

THanks,
Walaa.


On Thu, May 16, 2024 at 3:22 PM Benny Chow <btc...@gmail.com> wrote:

> Hi Walaa
>
> I left comments in your spec PR:
> https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169
>  My last question about use cases was really about incremental refresh with
> aggregates.  But I think this might be too complicated to try to
> model/discuss now and so I agree with Micah's comment about doing it in a
> future iteration.
>
> Hi Jan,
>
> Regarding storing the identifiers, I like the idea too.  Dremio's query
> engine supports MVs on sources besides Iceberg tables.  Here's everything
> that's in a single lineage entry:
> https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80
> The lineage is stored as a graph and not a list of entries.  I think for
> what we are trying to achieve, it's more practical to limit the lineage to
> Iceberg sources.
>
> Thanks
> Benny
>
>
>
> On Wed, May 15, 2024 at 12:06 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
>
>> I agree with Szehon and Benny that storing the lineage information as
>> multiple table properties is too brittle, especially for many source tables
>> (base tables). I would prefer to have the whole lineage information in one
>> entry as it is more concise. This is also how Trino has been doing it, as
>> you can see here
>> <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915>
>> .
>>
>> As we've discussed in the google doc
>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>:
>> it is helpful to also store the table identifiers of the source tables to
>> enable clients to determine the freshness of the MV that don't understand
>> the SQL dialect of the MV definition, like other query engines, BI tools
>> and Dataframe libraries. This is also how Trino is doing it. That's why we
>> chose the design in the google doc.
>>
>> Storing the storage table identifier as a property might work.
>>
>> Thanks, Jan
>> On 15.05.24 02:38, Walaa Eldin Moustafa wrote:
>>
>> Thanks Benny. My specific thoughts about the spec and the properties are
>> captured in the spec PR https://github.com/apache/iceberg/pull/10280.
>> The spec is also implemented in the Spark implementation PR
>> https://github.com/apache/iceberg/pull/9830, and I believe this follows
>> the same nature of how the information was captured in Netflix's
>> implementation with Spark, and Trino implementation (prior to formalizing
>> through that spec), both of which have been used reliably for years. I
>> think it also aligns with Ryan's feedback here
>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 which
>> indicated the usage of properties.
>>
>> The reasons for choosing properties:
>> * Not every table is a storage table and not every view is a materialized
>> view. I feel exposing the info as top level metadata is an overkill for the
>> original object type.
>> * The properties are simple. They contain either single snapshot ID each,
>> or single view version each, or lastly, the storage table identifier.
>> Engines can use them without issues (also as shown in the implementation).
>> * To be meaningful, the metadata fields should be captured in the engine
>> API as well, which is an effort that has to be pursued outside the Iceberg
>> community. Until engine APIs evolve, we would have to define a mapping
>> between Iceberg metadata fields and engine properties (only current place
>> in engine side to capture this info). This requires an additional spec on
>> its own, and it will introduce complexities. Hence it is always cleaner to
>> map Iceberg properties to engine properties and Iceberg metadata to
>> designated engine APIs. Mixing and matching (e.g., Iceberg metadata fields
>> as engine properties) just for the lack of other cleaner options does not
>> sound like a good idea in both short and long term.
>>
>> Let me know your thoughts.
>>
>> Thanks,
>> Walaa.
>>
>>
>>
>> On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com> wrote:
>>
>>> I agree with Szheon here.  I think storing the materialization lineage
>>> as a bunch of properties is brittle.  This lineage information is needed by
>>> engines to validate the staleness of a materialization and also to perform
>>> full or incremental refreshes.  There’s a lot to capture here.
>>>
>>> Maybe we should drill down into the use cases first - such as
>>> incremental refresh with aggregates?  (Pick a harder one first 😀)
>>>
>>> I left a few comments about this in the doc.  I wonder what are your
>>> thoughts here Walaa?
>>>
>>> Thanks
>>>
>>> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa <wa.moust...@gmail.com>
>>> wrote:
>>>
>>> 
>>> Thanks John. The current metadata does not sound complex. We need to
>>> track the underlying table snapshot IDs as well as the view version ID. I
>>> agree as long as it is simple and before this feature fully matures, we
>>> should track it in properties.
>>>
>>> One important factor for me (apart from the API effort, especially on
>>> the engine side), is that not each table is an MV storage table. Surfacing
>>> MV-specific storage table properties as first class table metadata sounds
>>> to impose this metadata on every table, when it is not required for normal
>>> table operation (yes, they can be optional, but it does not sound like the
>>> use case warrants exposing them as metadata fields yet).
>>>
>>> Similarly, since not every view is a materialized view, it sounds
>>> reasonable to keep MV-specific data in properties.
>>>
>>> UUID (for views), on the other hand, is common to all views, hence it
>>> made sense to add it as a top level field.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org> wrote:
>>>
>>>> Hi Szheon,
>>>>
>>>> While I fully share your concern of abusing table properties, we took
>>>> the approach of option 1 and run it in production for several years:
>>>>
>>>>    - the feature was still evolving
>>>>    - quick and simple implementation
>>>>    - table properties are simple enough and not confusing
>>>>    - haven't seen an urgent need to convert the properties to metadata
>>>>    fields and add API
>>>>    - do not wish on-disk changes (requiring lengthy tedious migration)
>>>>
>>>>
>>>> That said, I am open to codifying the mv metadata into api and spec,
>>>> with the following considerations
>>>>
>>>>    - mv metadata has reached maturity and consensus (could be just a
>>>>    core portion)
>>>>    - when mv metadata becomes too complex
>>>>    - wish to support use cases that are quicker to adopt API changes
>>>>    (than engines), e.g., using Iceberg library to manipulate MVs, or 
>>>> parsing
>>>>    metadata files directly
>>>>    - Spark view catalog API can evolve separately from Iceberg API and
>>>>    spec changes
>>>>
>>>>
>>>> Thanks all for the great discussion!
>>>>
>>>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Hi Szheon,
>>>>>
>>>>> Thanks for the follow-up. It is possible some of the concerns were
>>>>> referring to the backend catalogs, but it is all connected. My main
>>>>> personal concern is from the engine connector APIs point of view, but I
>>>>> share the concern about the catalogs.
>>>>>
>>>>> I think everyone's concern is not about the complexity* per* backend
>>>>> catalog/engine catalog API (in which case adding new metadata to
>>>>> existing objects could require less code), but rather about the
>>>>> *number* of APIs and implementations that need to change (in
>>>>> which case both new metadata to existing objects and new objects 
>>>>> altogether
>>>>> introduce equal complexity).
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Walaa
>>>>>>
>>>>>> OK thanks for confirming.  I am still not 100% in agreement, my
>>>>>> understanding of the rationale for separate Table/View objects in the
>>>>>> comment that you linked:
>>>>>>
>>>>>> I think the biggest problem with this is that we would need to modify
>>>>>>> every catalog to support this combination and that would be really
>>>>>>> difficult.
>>>>>>
>>>>>>
>>>>>> is about JavaCatalogs /REST Catalog needing to to support creating ,
>>>>>> persisting, and loading a MaterializedView object, which is much more
>>>>>> complex.  See HiveView PR for example :
>>>>>> https://github.com/apache/iceberg/pull/9852   We would have to do
>>>>>> the same exercise for persisting MV.
>>>>>>
>>>>>> In our case though, there's not much complexity regardless of
>>>>>> approach ('properties' or new metadata fields), in terms of Java
>>>>>> Catalog/REST Catalog.  It's mostly pass-through to storage.  Looks like 
>>>>>> you
>>>>>> are referring to Spark's View model in terms of complexity, which may be 
>>>>>> a
>>>>>> different story, but not sure if it is a good rationale to make Iceberg 
>>>>>> to
>>>>>> use 'properties' .
>>>>>>
>>>>>> 'properties'  is for read/write configurations, not to save
>>>>>> metadatas.  To me, its also brittle to save important metadata, as it's 
>>>>>> not
>>>>>> in the defined schema.
>>>>>>
>>>>>> A string to string map of table properties. This is used to control
>>>>>>> settings that affect reading and writing and is not intended to be used 
>>>>>>> for
>>>>>>> arbitrary metadata.  For example, commit.retry.num-retries is used
>>>>>>> to control the number of commit retries.
>>>>>>
>>>>>>
>>>>>> On the other hand, the Draft Spec suggests to save `lineage` as a
>>>>>> modeled field on the Storage Table's snapshot metadata.  This allows you 
>>>>>> to
>>>>>> 'time travel', 'branch', and have this metadata life cycle integrated via
>>>>>> normal snapshots lifecycle operations.
>>>>>>
>>>>>> So that's my rationale.  Not sure if we can come to an agreement over
>>>>>> email though, and may need others to chime in as well.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Szehon,
>>>>>>>
>>>>>>> Yes, you are reading the PR correctly, and interpreting the meaning
>>>>>>> of properties correctly. I think the reply you pasted from Ryan refers 
>>>>>>> to
>>>>>>> the same concept as well.
>>>>>>>
>>>>>>> For the initial Google doc and the issue (by the way it is an issue,
>>>>>>> not a PR), yes both are proposing new metadata fields.
>>>>>>>
>>>>>>> The references I made to the modeling doc [1, 2] are reasons why new
>>>>>>> APIs are not desired. The cons/concerns applicable to new MV metadata 
>>>>>>> apply
>>>>>>> by extension to new table and view metadata fields.
>>>>>>>
>>>>>>> The reason why new metadata adds complexity is that this new
>>>>>>> metadata needs to be propagated to the engine API. For example, here is 
>>>>>>> the
>>>>>>> ViewInfo [3] class in the Spark catalog, which is used in view methods 
>>>>>>> like
>>>>>>> createView. Its fields correspond with the Iceberg metadata. Adding new
>>>>>>> Iceberg fields should be accompanied with new fields in the engine
>>>>>>> catalog/connector APIs, which was a major reason for rejecting the 
>>>>>>> combined
>>>>>>> MV object model as well.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>> [2]
>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>> [3]
>>>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Walaa
>>>>>>>>
>>>>>>>> As there may be confusion in the word 'properties', I want to
>>>>>>>> double check if we are talking about the same thing here.
>>>>>>>>
>>>>>>>> I am reading your PR as adding lineage metadata as new key/value
>>>>>>>> pair under the storage Table's 'properties' field:
>>>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>>>>>>>
>>>>>>>> *optional* *optional* *properties* A string to string map of table
>>>>>>>> properties. This is used to control settings that affect reading and
>>>>>>>> writing and is not intended to be used for arbitrary metadata. For 
>>>>>>>> example,
>>>>>>>> commit.retry.num-retries is used to control the number of commit
>>>>>>>> retries.
>>>>>>>> and adding Storage Table pointer as key/value pair in the View's
>>>>>>>> 'properties' field:
>>>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>>>>>>>
>>>>>>>> *optional* properties A string to string map of view properties [2]
>>>>>>>> Is that correct?
>>>>>>>>
>>>>>>>> On the other hand, I was talking about adding this metadata as
>>>>>>>> actual fields, as is described in the Draft Spec of the Design Doc
>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>>>>>>>  and
>>>>>>>> first PR https://github.com/apache/iceberg/issues/6420 .
>>>>>>>>
>>>>>>>> Do you mean, the vote means we cannot model new fields like
>>>>>>>> 'materialization' and 'lineage' as was proposed there ?    If that is 
>>>>>>>> the
>>>>>>>> interpretation, I am not sure I agree.  I dont fully see how new fields
>>>>>>>> adds more catalog implementation complexity over new key/value 
>>>>>>>> properties?
>>>>>>>> To me, the vote seemed to just rule out using a combined catalog object
>>>>>>>> (MaterializedView) in favor of re-using the Table and View metadata 
>>>>>>>> models,
>>>>>>>> not to prevent change to the Table and View model.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa <
>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Szehon,
>>>>>>>>>
>>>>>>>>> I think choosing separate view + table objects precludes us from
>>>>>>>>> adding new metadata to table and view metadata. Here is one relevant
>>>>>>>>> comment [1] from Ryan on the modeling doc, where his point is that we 
>>>>>>>>> want
>>>>>>>>> to avoid introducing new APIs since it requires updating every 
>>>>>>>>> catalog, and
>>>>>>>>> (quoting) even now, we have few implementations that support views 
>>>>>>>>> because
>>>>>>>>> of the problems updating back ends. Therefore, one of the major 
>>>>>>>>> reasons to
>>>>>>>>> avoid a new model with new metadata is to avoid adding new metadata, 
>>>>>>>>> which
>>>>>>>>> introduces this complexity. Here is another similar comment from 
>>>>>>>>> Renjie [2]
>>>>>>>>> on the cons listed for the combined object approach.
>>>>>>>>>
>>>>>>>>> Even Ryan's point on the MV issue that you referenced reads to me
>>>>>>>>> as he is supportive of the property model. Here are some quotes:
>>>>>>>>>
>>>>>>>>> > We would still want some MV metadata in table *properties*.
>>>>>>>>>
>>>>>>>>> > I recommend instead reusing the existing snapshot metadata
>>>>>>>>> structure to store what you need as snapshot *properties*.
>>>>>>>>>
>>>>>>>>> > First, I think we want to avoid keeping much state information
>>>>>>>>> in complex table *properties*.
>>>>>>>>>
>>>>>>>>> Again, here, he is supportive of table properties, but wants to
>>>>>>>>> make sure that the information is simple.
>>>>>>>>>
>>>>>>>>> > We may want additional metadata as well, like a UUID to ensure
>>>>>>>>> we have the right view. I don't think we have a UUID in the view spec 
>>>>>>>>> yet,
>>>>>>>>> but we could add one.
>>>>>>>>>
>>>>>>>>> Here, he is very specific when it comes to new metadata fields,
>>>>>>>>> and explicitly calls it out. That is the only new metadata field in 
>>>>>>>>> that
>>>>>>>>> reply and by now it is already supported. It is also not MV-specific.
>>>>>>>>>
>>>>>>>>> Hope this addresses your question on the property vs new metadata
>>>>>>>>> model.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>>> [2]
>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Walaa,
>>>>>>>>>>
>>>>>>>>>> I agree, I definitely do not want yet another pr/doc where
>>>>>>>>>> discussion happens. as its already quite spread out :)  But did not 
>>>>>>>>>> want to
>>>>>>>>>> clarify some points before we get started on the discussion on your 
>>>>>>>>>> PR.
>>>>>>>>>>
>>>>>>>>>> With reusing the table and view objects, we are not changing the
>>>>>>>>>>> existing metadata of either table or view spec but rather introduce 
>>>>>>>>>>> new
>>>>>>>>>>> properties and formalize them to express materialized views
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On this point, I am not 100% sure that choosing to represent a
>>>>>>>>>> MaterializedView as a separate View + Table object precludes us from 
>>>>>>>>>> adding
>>>>>>>>>> to metadata of Table or View as the Draft Spec suggested, though.  I 
>>>>>>>>>> think
>>>>>>>>>> this point was discussed in Jan's initial PR with a good point from 
>>>>>>>>>> Ryan:
>>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
>>>>>>>>>>  that
>>>>>>>>>> using Table Properties to track lineage is fairly brittle, and 
>>>>>>>>>> having it
>>>>>>>>>> formalized in the Iceberg metadata is cleaner, and that was thus the
>>>>>>>>>> direction of the Draft Spec in the design doc.  What do people think?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa <
>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Szehon.
>>>>>>>>>>>
>>>>>>>>>>> The reason for the difference is that the proposal in the Google
>>>>>>>>>>> doc is based on a new MV model, hence, new metadata fields and a new
>>>>>>>>>>> metadata model were being introduced (with types, optionality, 
>>>>>>>>>>> etc). With
>>>>>>>>>>> reusing the table and view objects, we are not changing the existing
>>>>>>>>>>> metadata of either table or view spec but rather introduce new 
>>>>>>>>>>> properties
>>>>>>>>>>> and formalize them to express materialized views. This would be the 
>>>>>>>>>>> answer
>>>>>>>>>>> to most of the questions you posted on the PR (besides some naming
>>>>>>>>>>> questions, which I think should be straightforward).
>>>>>>>>>>>
>>>>>>>>>>> With that fundamental difference, we cannot lift and shift what
>>>>>>>>>>> is in the doc to any PR. Further, having consensus on separate 
>>>>>>>>>>> table and
>>>>>>>>>>> view objects contradicts with the point being made on having 
>>>>>>>>>>> consensus on
>>>>>>>>>>> the doc. We might have had agreements on some elements, but 
>>>>>>>>>>> definitely not
>>>>>>>>>>> on the whole doc, proven by the follow ups (also as a community, not
>>>>>>>>>>> individuals).
>>>>>>>>>>>
>>>>>>>>>>> Therefore: we need a new space to discuss the separate table and
>>>>>>>>>>> view properties.
>>>>>>>>>>>
>>>>>>>>>>> Is the question whether to:
>>>>>>>>>>> 1- Create a new doc
>>>>>>>>>>> 2- Create a new PR?
>>>>>>>>>>>
>>>>>>>>>>> I feel a PR is the most effective way, especially given the fact
>>>>>>>>>>> that we discussed the topic a lot by now. If we agree, we can 
>>>>>>>>>>> continue the
>>>>>>>>>>> discussion on the PR, else, we can create a doc.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Walaa.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <
>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Walaa for driving it forward, looking forward to
>>>>>>>>>>>> thinking about implementation of Materialized Views.
>>>>>>>>>>>>
>>>>>>>>>>>> I see Jan's point, the PR spec change is similar but does not
>>>>>>>>>>>> seem to be completely aligned with the Draft Spec in the design 
>>>>>>>>>>>> doc:
>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>>>>>>>>>>>> .  I left my comments on PR of those sections with the links to the
>>>>>>>>>>>> difference.  I think most of those Draft Spec proposal is still 
>>>>>>>>>>>> applicable
>>>>>>>>>>>> after the decision to have separate Table and View objects  It 
>>>>>>>>>>>> will be
>>>>>>>>>>>> interesting to at least see drill a bit further why we did not 
>>>>>>>>>>>> choose the
>>>>>>>>>>>> approach in the Draft Spec and chose another way.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Szehon
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul
>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Well, everybody that actively contributed to the discussion on
>>>>>>>>>>>>> the original google doc was in consensus. That's why I brought up 
>>>>>>>>>>>>> the topic
>>>>>>>>>>>>> at the Community Sync on the 2024-02-14 (
>>>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the awareness of
>>>>>>>>>>>>> the broader community. After which the discussion about the 
>>>>>>>>>>>>> storage model
>>>>>>>>>>>>> started. I don't think that the discussion about a single aspect 
>>>>>>>>>>>>> of a
>>>>>>>>>>>>> proposal should invalidate all other aspects of the proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regardless, the state of the proposal from the original google
>>>>>>>>>>>>> doc contains a lot of valuable contributions from Micah, Szehon, 
>>>>>>>>>>>>> Jack, Dan,
>>>>>>>>>>>>> yourself and others and it should at least provide the basis for 
>>>>>>>>>>>>> any
>>>>>>>>>>>>> further discussion. I don't think it's effective to start with a 
>>>>>>>>>>>>> completely
>>>>>>>>>>>>> different design because we are bound to have the same 
>>>>>>>>>>>>> discussions all over
>>>>>>>>>>>>> again.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only consensus the community had was on the object model
>>>>>>>>>>>>> through the most recent voting thread [1]. This kind of consensus 
>>>>>>>>>>>>> was not
>>>>>>>>>>>>> present during the doc discussions, and this should be evident 
>>>>>>>>>>>>> from the
>>>>>>>>>>>>> fact the last doc state listed 5 alternatives with no particular
>>>>>>>>>>>>> conclusion. I am not quite sure what type of consensus we are 
>>>>>>>>>>>>> referring to
>>>>>>>>>>>>> here given all the follow up discussions, alternatives, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Due to the separate object model, the PR is fundamentally
>>>>>>>>>>>>> different from the doc in the sense it does not propose a new 
>>>>>>>>>>>>> metadata
>>>>>>>>>>>>> model but rather formalizes some new table and view properties 
>>>>>>>>>>>>> related to
>>>>>>>>>>>>> MVs. That is also one reason there are no repeated discussions. 
>>>>>>>>>>>>> That said,
>>>>>>>>>>>>> if you feel there is a repeated discussion (which I do not see so 
>>>>>>>>>>>>> far), it
>>>>>>>>>>>>> would be best to link the relevant discussion from the doc in a 
>>>>>>>>>>>>> comment.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Happy to move the discussion elsewhere if there is
>>>>>>>>>>>>> sufficient support for this idea, but as things stand, I do not 
>>>>>>>>>>>>> see this as
>>>>>>>>>>>>> an efficient way to make progress. It sounds we have been 
>>>>>>>>>>>>> re-emphasizing
>>>>>>>>>>>>> the same points in the last two replies, so I will let others 
>>>>>>>>>>>>> chime in at
>>>>>>>>>>>>> this point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul
>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The original google doc
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
>>>>>>>>>>>>>> discussed multiple aspects of the Materialized View spec. One 
>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>> storage model while others were related to the metadata. After 
>>>>>>>>>>>>>> we (Micah,
>>>>>>>>>>>>>> Szehon, you, me) reached consensus in the google doc, Jack 
>>>>>>>>>>>>>> raised his
>>>>>>>>>>>>>> concern about the storage model and the long discussion about 
>>>>>>>>>>>>>> the storage
>>>>>>>>>>>>>> model started. Now we truly reached consensus about the storage 
>>>>>>>>>>>>>> model,
>>>>>>>>>>>>>> which is now also reflected in the google doc. All other aspects 
>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>> google doc about the metadata weren't questioned and still 
>>>>>>>>>>>>>> represent the
>>>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to *avoid repeating the discussions* in your PR
>>>>>>>>>>>>>> that we already had in the google doc. Especially since we 
>>>>>>>>>>>>>> reached
>>>>>>>>>>>>>> consensus which took a considerable amount of time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>>>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Jan. I think we moved on to more alignment steps
>>>>>>>>>>>>>> beyond that doc a while ago. After that doc, we have discussed 
>>>>>>>>>>>>>> the topic
>>>>>>>>>>>>>> further in 2 dev list threads and one more doc
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1>
>>>>>>>>>>>>>> (with strictly two options for the storage model to consider). 
>>>>>>>>>>>>>> Moreover,
>>>>>>>>>>>>>> the original doc grew to 14 pages long with one section 
>>>>>>>>>>>>>> comparing 5 design
>>>>>>>>>>>>>> alternatives, which made things harder to reach consensus. The 
>>>>>>>>>>>>>> lack of
>>>>>>>>>>>>>> consensus is what partly led up to the subsequent discussions 
>>>>>>>>>>>>>> and call for
>>>>>>>>>>>>>> a more focused approach to reach consensus. If we already have a 
>>>>>>>>>>>>>> consensus
>>>>>>>>>>>>>> on the storage model (separate tables and views), I think we 
>>>>>>>>>>>>>> should take
>>>>>>>>>>>>>> things further and have continued focused discussions on the 
>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>> metadata in the form of a PR. I have included all previous 
>>>>>>>>>>>>>> discussions
>>>>>>>>>>>>>> including the original doc and issue as references in the PR 
>>>>>>>>>>>>>> description.
>>>>>>>>>>>>>> Please let me know if this works. Happy to hear others' thoughts 
>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>> best way to move forward.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, May 8, 2024 at 12:56 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Walaa for trying to move things along. However I
>>>>>>>>>>>>>>> don't think it's a good idea to start a separate discussion 
>>>>>>>>>>>>>>> about the
>>>>>>>>>>>>>>> metadata for materialized views because we already had this 
>>>>>>>>>>>>>>> discussion and
>>>>>>>>>>>>>>> reached consensus in this google doc:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Once the draft is finalized we can adopt the PR to reflect
>>>>>>>>>>>>>>> the consensus from the google doc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>>> On 07.05.24 19:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Steven. I feel it is needed so the MV spec is not
>>>>>>>>>>>>>>> scattered across the table and view spec pages. We may add a 
>>>>>>>>>>>>>>> reference in
>>>>>>>>>>>>>>> each respective properties section.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 10:04 AM Steven Wu <
>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Walaa, thanks for initiating the next step.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With the agreed model of separate view and storage table, I
>>>>>>>>>>>>>>>> am wondering if a separate materialized view spec page is 
>>>>>>>>>>>>>>>> needed. E.g., the
>>>>>>>>>>>>>>>> new view metadata (view-materialized and view-storage-table) 
>>>>>>>>>>>>>>>> is probably
>>>>>>>>>>>>>>>> good to be added to the view page directly to avoid 
>>>>>>>>>>>>>>>> information scattering.
>>>>>>>>>>>>>>>> The same can be said about the storage table metadata.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We may keep the separate materialized view page to document
>>>>>>>>>>>>>>>> motivation, freshness semantics, etc..
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 6, 2024 at 10:58 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks again for participating in the modeling discussion
>>>>>>>>>>>>>>>>> [1]. Since the outcome of this discussion was to model 
>>>>>>>>>>>>>>>>> materialized views
>>>>>>>>>>>>>>>>> as separate objects, an Iceberg view and a table, I think the 
>>>>>>>>>>>>>>>>> next step
>>>>>>>>>>>>>>>>> should be discussing the metadata details for each object. I 
>>>>>>>>>>>>>>>>> have created a
>>>>>>>>>>>>>>>>> PR https://github.com/apache/iceberg/pull/10280 with an
>>>>>>>>>>>>>>>>> initial spec improvement. Please feel free to review it and 
>>>>>>>>>>>>>>>>> leave feedback
>>>>>>>>>>>>>>>>> there.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>

Re: Materialized Views: Next Steps

Reply via email to