Hi Szheon,

While I fully share your concern of abusing table properties, we took the
approach of option 1 and run it in production for several years:

   - the feature was still evolving
   - quick and simple implementation
   - table properties are simple enough and not confusing
   - haven't seen an urgent need to convert the properties to metadata
   fields and add API
   - do not wish on-disk changes (requiring lengthy tedious migration)


That said, I am open to codifying the mv metadata into api and spec, with
the following considerations

   - mv metadata has reached maturity and consensus (could be just a core
   portion)
   - when mv metadata becomes too complex
   - wish to support use cases that are quicker to adopt API changes (than
   engines), e.g., using Iceberg library to manipulate MVs, or parsing
   metadata files directly
   - Spark view catalog API can evolve separately from Iceberg API and spec
   changes


Thanks all for the great discussion!

On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Szheon,
>
> Thanks for the follow-up. It is possible some of the concerns were
> referring to the backend catalogs, but it is all connected. My main
> personal concern is from the engine connector APIs point of view, but I
> share the concern about the catalogs.
>
> I think everyone's concern is not about the complexity* per* backend
> catalog/engine catalog API (in which case adding new metadata to
> existing objects could require less code), but rather about the *number* of
> APIs and implementations that need to change (in which case both new
> metadata to existing objects and new objects altogether introduce equal
> complexity).
>
> Thanks,
> Walaa.
>
>
> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
>
>> Hi Walaa
>>
>> OK thanks for confirming.  I am still not 100% in agreement, my
>> understanding of the rationale for separate Table/View objects in the
>> comment that you linked:
>>
>> I think the biggest problem with this is that we would need to modify
>>> every catalog to support this combination and that would be really
>>> difficult.
>>
>>
>> is about JavaCatalogs /REST Catalog needing to to support creating ,
>> persisting, and loading a MaterializedView object, which is much more
>> complex.  See HiveView PR for example :
>> https://github.com/apache/iceberg/pull/9852   We would have to do the
>> same exercise for persisting MV.
>>
>> In our case though, there's not much complexity regardless of approach
>> ('properties' or new metadata fields), in terms of Java Catalog/REST
>> Catalog.  It's mostly pass-through to storage.  Looks like you are
>> referring to Spark's View model in terms of complexity, which may be a
>> different story, but not sure if it is a good rationale to make Iceberg to
>> use 'properties' .
>>
>> 'properties'  is for read/write configurations, not to save metadatas.
>> To me, its also brittle to save important metadata, as it's not in the
>> defined schema.
>>
>> A string to string map of table properties. This is used to control
>>> settings that affect reading and writing and is not intended to be used for
>>> arbitrary metadata.  For example, commit.retry.num-retries is used to
>>> control the number of commit retries.
>>
>>
>> On the other hand, the Draft Spec suggests to save `lineage` as a modeled
>> field on the Storage Table's snapshot metadata.  This allows you to 'time
>> travel', 'branch', and have this metadata life cycle integrated via normal
>> snapshots lifecycle operations.
>>
>> So that's my rationale.  Not sure if we can come to an agreement over
>> email though, and may need others to chime in as well.
>>
>> Thanks
>> Szehon
>>
>>
>>
>>
>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Szehon,
>>>
>>> Yes, you are reading the PR correctly, and interpreting the meaning of
>>> properties correctly. I think the reply you pasted from Ryan refers to the
>>> same concept as well.
>>>
>>> For the initial Google doc and the issue (by the way it is an issue, not
>>> a PR), yes both are proposing new metadata fields.
>>>
>>> The references I made to the modeling doc [1, 2] are reasons why new
>>> APIs are not desired. The cons/concerns applicable to new MV metadata apply
>>> by extension to new table and view metadata fields.
>>>
>>> The reason why new metadata adds complexity is that this new metadata
>>> needs to be propagated to the engine API. For example, here is the ViewInfo
>>> [3] class in the Spark catalog, which is used in view methods like
>>> createView. Its fields correspond with the Iceberg metadata. Adding new
>>> Iceberg fields should be accompanied with new fields in the engine
>>> catalog/connector APIs, which was a major reason for rejecting the combined
>>> MV object model as well.
>>>
>>> [1]
>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>> [2]
>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>> [3]
>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Walaa
>>>>
>>>> As there may be confusion in the word 'properties', I want to double
>>>> check if we are talking about the same thing here.
>>>>
>>>> I am reading your PR as adding lineage metadata as new key/value pair
>>>> under the storage Table's 'properties' field:
>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>>>
>>>> *optional* *optional* *properties* A string to string map of table
>>>> properties. This is used to control settings that affect reading and
>>>> writing and is not intended to be used for arbitrary metadata. For example,
>>>> commit.retry.num-retries is used to control the number of commit
>>>> retries.
>>>> and adding Storage Table pointer as key/value pair in the View's
>>>> 'properties' field:
>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>>>
>>>> *optional* properties A string to string map of view properties [2]
>>>> Is that correct?
>>>>
>>>> On the other hand, I was talking about adding this metadata as actual
>>>> fields, as is described in the Draft Spec of the Design Doc
>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>>>  and
>>>> first PR https://github.com/apache/iceberg/issues/6420 .
>>>>
>>>> Do you mean, the vote means we cannot model new fields like
>>>> 'materialization' and 'lineage' as was proposed there ?    If that is the
>>>> interpretation, I am not sure I agree.  I dont fully see how new fields
>>>> adds more catalog implementation complexity over new key/value properties?
>>>> To me, the vote seemed to just rule out using a combined catalog object
>>>> (MaterializedView) in favor of re-using the Table and View metadata models,
>>>> not to prevent change to the Table and View model.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>>
>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Hi Szehon,
>>>>>
>>>>> I think choosing separate view + table objects precludes us from
>>>>> adding new metadata to table and view metadata. Here is one relevant
>>>>> comment [1] from Ryan on the modeling doc, where his point is that we want
>>>>> to avoid introducing new APIs since it requires updating every catalog, 
>>>>> and
>>>>> (quoting) even now, we have few implementations that support views because
>>>>> of the problems updating back ends. Therefore, one of the major reasons to
>>>>> avoid a new model with new metadata is to avoid adding new metadata, which
>>>>> introduces this complexity. Here is another similar comment from Renjie 
>>>>> [2]
>>>>> on the cons listed for the combined object approach.
>>>>>
>>>>> Even Ryan's point on the MV issue that you referenced reads to me as
>>>>> he is supportive of the property model. Here are some quotes:
>>>>>
>>>>> > We would still want some MV metadata in table *properties*.
>>>>>
>>>>> > I recommend instead reusing the existing snapshot metadata structure
>>>>> to store what you need as snapshot *properties*.
>>>>>
>>>>> > First, I think we want to avoid keeping much state information in
>>>>> complex table *properties*.
>>>>>
>>>>> Again, here, he is supportive of table properties, but wants to make
>>>>> sure that the information is simple.
>>>>>
>>>>> > We may want additional metadata as well, like a UUID to ensure we
>>>>> have the right view. I don't think we have a UUID in the view spec yet, 
>>>>> but
>>>>> we could add one.
>>>>>
>>>>> Here, he is very specific when it comes to new metadata fields, and
>>>>> explicitly calls it out. That is the only new metadata field in that reply
>>>>> and by now it is already supported. It is also not MV-specific.
>>>>>
>>>>> Hope this addresses your question on the property vs new metadata
>>>>> model.
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>> [2]
>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Walaa,
>>>>>>
>>>>>> I agree, I definitely do not want yet another pr/doc where discussion
>>>>>> happens. as its already quite spread out :)  But did not want to clarify
>>>>>> some points before we get started on the discussion on your PR.
>>>>>>
>>>>>> With reusing the table and view objects, we are not changing the
>>>>>>> existing metadata of either table or view spec but rather introduce new
>>>>>>> properties and formalize them to express materialized views
>>>>>>>
>>>>>>
>>>>>> On this point, I am not 100% sure that choosing to represent a
>>>>>> MaterializedView as a separate View + Table object precludes us from 
>>>>>> adding
>>>>>> to metadata of Table or View as the Draft Spec suggested, though.  I 
>>>>>> think
>>>>>> this point was discussed in Jan's initial PR with a good point from Ryan:
>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 
>>>>>> that
>>>>>> using Table Properties to track lineage is fairly brittle, and having it
>>>>>> formalized in the Iceberg metadata is cleaner, and that was thus the
>>>>>> direction of the Draft Spec in the design doc.  What do people think?
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Szehon.
>>>>>>>
>>>>>>> The reason for the difference is that the proposal in the Google doc
>>>>>>> is based on a new MV model, hence, new metadata fields and a new 
>>>>>>> metadata
>>>>>>> model were being introduced (with types, optionality, etc). With reusing
>>>>>>> the table and view objects, we are not changing the existing metadata of
>>>>>>> either table or view spec but rather introduce new properties and 
>>>>>>> formalize
>>>>>>> them to express materialized views. This would be the answer to most of 
>>>>>>> the
>>>>>>> questions you posted on the PR (besides some naming questions, which I
>>>>>>> think should be straightforward).
>>>>>>>
>>>>>>> With that fundamental difference, we cannot lift and shift what is
>>>>>>> in the doc to any PR. Further, having consensus on separate table and 
>>>>>>> view
>>>>>>> objects contradicts with the point being made on having consensus on the
>>>>>>> doc. We might have had agreements on some elements, but definitely not 
>>>>>>> on
>>>>>>> the whole doc, proven by the follow ups (also as a community, not
>>>>>>> individuals).
>>>>>>>
>>>>>>> Therefore: we need a new space to discuss the separate table and
>>>>>>> view properties.
>>>>>>>
>>>>>>> Is the question whether to:
>>>>>>> 1- Create a new doc
>>>>>>> 2- Create a new PR?
>>>>>>>
>>>>>>> I feel a PR is the most effective way, especially given the fact
>>>>>>> that we discussed the topic a lot by now. If we agree, we can continue 
>>>>>>> the
>>>>>>> discussion on the PR, else, we can create a doc.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Walaa for driving it forward, looking forward to thinking
>>>>>>>> about implementation of Materialized Views.
>>>>>>>>
>>>>>>>> I see Jan's point, the PR spec change is similar but does not seem
>>>>>>>> to be completely aligned with the Draft Spec in the design doc:
>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>>>>>>>> .  I left my comments on PR of those sections with the links to the
>>>>>>>> difference.  I think most of those Draft Spec proposal is still 
>>>>>>>> applicable
>>>>>>>> after the decision to have separate Table and View objects  It will be
>>>>>>>> interesting to at least see drill a bit further why we did not choose 
>>>>>>>> the
>>>>>>>> approach in the Draft Spec and chose another way.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul <jank...@mailbox.org.invalid>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Well, everybody that actively contributed to the discussion on the
>>>>>>>>> original google doc was in consensus. That's why I brought up the 
>>>>>>>>> topic at
>>>>>>>>> the Community Sync on the 2024-02-14 (
>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the awareness of the
>>>>>>>>> broader community. After which the discussion about the storage model
>>>>>>>>> started. I don't think that the discussion about a single aspect of a
>>>>>>>>> proposal should invalidate all other aspects of the proposal.
>>>>>>>>>
>>>>>>>>> Regardless, the state of the proposal from the original google doc
>>>>>>>>> contains a lot of valuable contributions from Micah, Szehon, Jack, 
>>>>>>>>> Dan,
>>>>>>>>> yourself and others and it should at least provide the basis for any
>>>>>>>>> further discussion. I don't think it's effective to start with a 
>>>>>>>>> completely
>>>>>>>>> different design because we are bound to have the same discussions 
>>>>>>>>> all over
>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>> Thanks, Jan
>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>
>>>>>>>>> The only consensus the community had was on the object model
>>>>>>>>> through the most recent voting thread [1]. This kind of consensus was 
>>>>>>>>> not
>>>>>>>>> present during the doc discussions, and this should be evident from 
>>>>>>>>> the
>>>>>>>>> fact the last doc state listed 5 alternatives with no particular
>>>>>>>>> conclusion. I am not quite sure what type of consensus we are 
>>>>>>>>> referring to
>>>>>>>>> here given all the follow up discussions, alternatives, etc.
>>>>>>>>>
>>>>>>>>> Due to the separate object model, the PR is fundamentally
>>>>>>>>> different from the doc in the sense it does not propose a new metadata
>>>>>>>>> model but rather formalizes some new table and view properties 
>>>>>>>>> related to
>>>>>>>>> MVs. That is also one reason there are no repeated discussions. That 
>>>>>>>>> said,
>>>>>>>>> if you feel there is a repeated discussion (which I do not see so 
>>>>>>>>> far), it
>>>>>>>>> would be best to link the relevant discussion from the doc in a 
>>>>>>>>> comment.
>>>>>>>>>
>>>>>>>>> Happy to move the discussion elsewhere if there is
>>>>>>>>> sufficient support for this idea, but as things stand, I do not see 
>>>>>>>>> this as
>>>>>>>>> an efficient way to make progress. It sounds we have been 
>>>>>>>>> re-emphasizing
>>>>>>>>> the same points in the last two replies, so I will let others chime 
>>>>>>>>> in at
>>>>>>>>> this point.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> The original google doc
>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
>>>>>>>>>> discussed multiple aspects of the Materialized View spec. One was the
>>>>>>>>>> storage model while others were related to the metadata. After we 
>>>>>>>>>> (Micah,
>>>>>>>>>> Szehon, you, me) reached consensus in the google doc, Jack raised his
>>>>>>>>>> concern about the storage model and the long discussion about the 
>>>>>>>>>> storage
>>>>>>>>>> model started. Now we truly reached consensus about the storage 
>>>>>>>>>> model,
>>>>>>>>>> which is now also reflected in the google doc. All other aspects 
>>>>>>>>>> from the
>>>>>>>>>> google doc about the metadata weren't questioned and still represent 
>>>>>>>>>> the
>>>>>>>>>> consensus.
>>>>>>>>>>
>>>>>>>>>> I would like to *avoid repeating the discussions* in your PR
>>>>>>>>>> that we already had in the google doc. Especially since we reached
>>>>>>>>>> consensus which took a considerable amount of time.
>>>>>>>>>>
>>>>>>>>>> Thanks, Jan
>>>>>>>>>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Jan. I think we moved on to more alignment steps beyond
>>>>>>>>>> that doc a while ago. After that doc, we have discussed the topic 
>>>>>>>>>> further
>>>>>>>>>> in 2 dev list threads and one more doc
>>>>>>>>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1>
>>>>>>>>>> (with strictly two options for the storage model to consider). 
>>>>>>>>>> Moreover,
>>>>>>>>>> the original doc grew to 14 pages long with one section comparing 5 
>>>>>>>>>> design
>>>>>>>>>> alternatives, which made things harder to reach consensus. The lack 
>>>>>>>>>> of
>>>>>>>>>> consensus is what partly led up to the subsequent discussions and 
>>>>>>>>>> call for
>>>>>>>>>> a more focused approach to reach consensus. If we already have a 
>>>>>>>>>> consensus
>>>>>>>>>> on the storage model (separate tables and views), I think we should 
>>>>>>>>>> take
>>>>>>>>>> things further and have continued focused discussions on the specific
>>>>>>>>>> metadata in the form of a PR. I have included all previous 
>>>>>>>>>> discussions
>>>>>>>>>> including the original doc and issue as references in the PR 
>>>>>>>>>> description.
>>>>>>>>>> Please let me know if this works. Happy to hear others' thoughts on 
>>>>>>>>>> the
>>>>>>>>>> best way to move forward.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 8, 2024 at 12:56 AM Jan Kaul
>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Walaa for trying to move things along. However I don't
>>>>>>>>>>> think it's a good idea to start a separate discussion about the 
>>>>>>>>>>> metadata
>>>>>>>>>>> for materialized views because we already had this discussion and 
>>>>>>>>>>> reached
>>>>>>>>>>> consensus in this google doc:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
>>>>>>>>>>>
>>>>>>>>>>> Once the draft is finalized we can adopt the PR to reflect the
>>>>>>>>>>> consensus from the google doc.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes,
>>>>>>>>>>>
>>>>>>>>>>> Jan
>>>>>>>>>>> On 07.05.24 19:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks Steven. I feel it is needed so the MV spec is not
>>>>>>>>>>> scattered across the table and view spec pages. We may add a 
>>>>>>>>>>> reference in
>>>>>>>>>>> each respective properties section.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 7, 2024 at 10:04 AM Steven Wu <stevenz...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Walaa, thanks for initiating the next step.
>>>>>>>>>>>>
>>>>>>>>>>>> With the agreed model of separate view and storage table, I am
>>>>>>>>>>>> wondering if a separate materialized view spec page is needed. 
>>>>>>>>>>>> E.g., the
>>>>>>>>>>>> new view metadata (view-materialized and view-storage-table) is 
>>>>>>>>>>>> probably
>>>>>>>>>>>> good to be added to the view page directly to avoid information 
>>>>>>>>>>>> scattering.
>>>>>>>>>>>> The same can be said about the storage table metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> We may keep the separate materialized view page to document
>>>>>>>>>>>> motivation, freshness semantics, etc..
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 6, 2024 at 10:58 PM Walaa Eldin Moustafa <
>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks again for participating in the modeling discussion [1].
>>>>>>>>>>>>> Since the outcome of this discussion was to model materialized 
>>>>>>>>>>>>> views as
>>>>>>>>>>>>> separate objects, an Iceberg view and a table, I think the next 
>>>>>>>>>>>>> step
>>>>>>>>>>>>> should be discussing the metadata details for each object. I have 
>>>>>>>>>>>>> created a
>>>>>>>>>>>>> PR https://github.com/apache/iceberg/pull/10280 with an
>>>>>>>>>>>>> initial spec improvement. Please feel free to review it and leave 
>>>>>>>>>>>>> feedback
>>>>>>>>>>>>> there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>>

-- 
John Zhuge

Reply via email to