Re: Materialized Views: Next Steps

Benny Chow Thu, 16 May 2024 15:22:48 -0700

Hi Walaa

I left comments in your spec PR:
https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169
 My last question about use cases was really about incremental refresh with
aggregates.  But I think this might be too complicated to try to
model/discuss now and so I agree with Micah's comment about doing it in a
future iteration.


Hi Jan,

Regarding storing the identifiers, I like the idea too.  Dremio's query
engine supports MVs on sources besides Iceberg tables.  Here's everything
that's in a single lineage entry:
https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80
The lineage is stored as a graph and not a list of entries.  I think for
what we are trying to achieve, it's more practical to limit the lineage to
Iceberg sources.

Thanks
Benny



On Wed, May 15, 2024 at 12:06 AM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> I agree with Szehon and Benny that storing the lineage information as
> multiple table properties is too brittle, especially for many source tables
> (base tables). I would prefer to have the whole lineage information in one
> entry as it is more concise. This is also how Trino has been doing it, as
> you can see here
> <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915>
> .
>
> As we've discussed in the google doc
> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>:
> it is helpful to also store the table identifiers of the source tables to
> enable clients to determine the freshness of the MV that don't understand
> the SQL dialect of the MV definition, like other query engines, BI tools
> and Dataframe libraries. This is also how Trino is doing it. That's why we
> chose the design in the google doc.
>
> Storing the storage table identifier as a property might work.
>
> Thanks, Jan
> On 15.05.24 02:38, Walaa Eldin Moustafa wrote:
>
> Thanks Benny. My specific thoughts about the spec and the properties are
> captured in the spec PR https://github.com/apache/iceberg/pull/10280. The
> spec is also implemented in the Spark implementation PR
> https://github.com/apache/iceberg/pull/9830, and I believe this follows
> the same nature of how the information was captured in Netflix's
> implementation with Spark, and Trino implementation (prior to formalizing
> through that spec), both of which have been used reliably for years. I
> think it also aligns with Ryan's feedback here
> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 which
> indicated the usage of properties.
>
> The reasons for choosing properties:
> * Not every table is a storage table and not every view is a materialized
> view. I feel exposing the info as top level metadata is an overkill for the
> original object type.
> * The properties are simple. They contain either single snapshot ID each,
> or single view version each, or lastly, the storage table identifier.
> Engines can use them without issues (also as shown in the implementation).
> * To be meaningful, the metadata fields should be captured in the engine
> API as well, which is an effort that has to be pursued outside the Iceberg
> community. Until engine APIs evolve, we would have to define a mapping
> between Iceberg metadata fields and engine properties (only current place
> in engine side to capture this info). This requires an additional spec on
> its own, and it will introduce complexities. Hence it is always cleaner to
> map Iceberg properties to engine properties and Iceberg metadata to
> designated engine APIs. Mixing and matching (e.g., Iceberg metadata fields
> as engine properties) just for the lack of other cleaner options does not
> sound like a good idea in both short and long term.
>
> Let me know your thoughts.
>
> Thanks,
> Walaa.
>
>
>
> On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com> wrote:
>
>> I agree with Szheon here.  I think storing the materialization lineage as
>> a bunch of properties is brittle.  This lineage information is needed by
>> engines to validate the staleness of a materialization and also to perform
>> full or incremental refreshes.  There’s a lot to capture here.
>>
>> Maybe we should drill down into the use cases first - such as incremental
>> refresh with aggregates?  (Pick a harder one first 😀)
>>
>> I left a few comments about this in the doc.  I wonder what are your
>> thoughts here Walaa?
>>
>> Thanks
>>
>> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa <wa.moust...@gmail.com>
>> wrote:
>>
>> 
>> Thanks John. The current metadata does not sound complex. We need to
>> track the underlying table snapshot IDs as well as the view version ID. I
>> agree as long as it is simple and before this feature fully matures, we
>> should track it in properties.
>>
>> One important factor for me (apart from the API effort, especially on the
>> engine side), is that not each table is an MV storage table. Surfacing
>> MV-specific storage table properties as first class table metadata sounds
>> to impose this metadata on every table, when it is not required for normal
>> table operation (yes, they can be optional, but it does not sound like the
>> use case warrants exposing them as metadata fields yet).
>>
>> Similarly, since not every view is a materialized view, it sounds
>> reasonable to keep MV-specific data in properties.
>>
>> UUID (for views), on the other hand, is common to all views, hence it
>> made sense to add it as a top level field.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org> wrote:
>>
>>> Hi Szheon,
>>>
>>> While I fully share your concern of abusing table properties, we took
>>> the approach of option 1 and run it in production for several years:
>>>
>>>    - the feature was still evolving
>>>    - quick and simple implementation
>>>    - table properties are simple enough and not confusing
>>>    - haven't seen an urgent need to convert the properties to metadata
>>>    fields and add API
>>>    - do not wish on-disk changes (requiring lengthy tedious migration)
>>>
>>>
>>> That said, I am open to codifying the mv metadata into api and spec,
>>> with the following considerations
>>>
>>>    - mv metadata has reached maturity and consensus (could be just a
>>>    core portion)
>>>    - when mv metadata becomes too complex
>>>    - wish to support use cases that are quicker to adopt API changes
>>>    (than engines), e.g., using Iceberg library to manipulate MVs, or parsing
>>>    metadata files directly
>>>    - Spark view catalog API can evolve separately from Iceberg API and
>>>    spec changes
>>>
>>>
>>> Thanks all for the great discussion!
>>>
>>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> Hi Szheon,
>>>>
>>>> Thanks for the follow-up. It is possible some of the concerns were
>>>> referring to the backend catalogs, but it is all connected. My main
>>>> personal concern is from the engine connector APIs point of view, but I
>>>> share the concern about the catalogs.
>>>>
>>>> I think everyone's concern is not about the complexity* per* backend
>>>> catalog/engine catalog API (in which case adding new metadata to
>>>> existing objects could require less code), but rather about the
>>>> *number* of APIs and implementations that need to change (in
>>>> which case both new metadata to existing objects and new objects altogether
>>>> introduce equal complexity).
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Walaa
>>>>>
>>>>> OK thanks for confirming.  I am still not 100% in agreement, my
>>>>> understanding of the rationale for separate Table/View objects in the
>>>>> comment that you linked:
>>>>>
>>>>> I think the biggest problem with this is that we would need to modify
>>>>>> every catalog to support this combination and that would be really
>>>>>> difficult.
>>>>>
>>>>>
>>>>> is about JavaCatalogs /REST Catalog needing to to support creating ,
>>>>> persisting, and loading a MaterializedView object, which is much more
>>>>> complex.  See HiveView PR for example :
>>>>> https://github.com/apache/iceberg/pull/9852   We would have to do the
>>>>> same exercise for persisting MV.
>>>>>
>>>>> In our case though, there's not much complexity regardless of approach
>>>>> ('properties' or new metadata fields), in terms of Java Catalog/REST
>>>>> Catalog.  It's mostly pass-through to storage.  Looks like you are
>>>>> referring to Spark's View model in terms of complexity, which may be a
>>>>> different story, but not sure if it is a good rationale to make Iceberg to
>>>>> use 'properties' .
>>>>>
>>>>> 'properties'  is for read/write configurations, not to save
>>>>> metadatas.  To me, its also brittle to save important metadata, as it's 
>>>>> not
>>>>> in the defined schema.
>>>>>
>>>>> A string to string map of table properties. This is used to control
>>>>>> settings that affect reading and writing and is not intended to be used 
>>>>>> for
>>>>>> arbitrary metadata.  For example, commit.retry.num-retries is used
>>>>>> to control the number of commit retries.
>>>>>
>>>>>
>>>>> On the other hand, the Draft Spec suggests to save `lineage` as a
>>>>> modeled field on the Storage Table's snapshot metadata.  This allows you 
>>>>> to
>>>>> 'time travel', 'branch', and have this metadata life cycle integrated via
>>>>> normal snapshots lifecycle operations.
>>>>>
>>>>> So that's my rationale.  Not sure if we can come to an agreement over
>>>>> email though, and may need others to chime in as well.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Hi Szehon,
>>>>>>
>>>>>> Yes, you are reading the PR correctly, and interpreting the meaning
>>>>>> of properties correctly. I think the reply you pasted from Ryan refers to
>>>>>> the same concept as well.
>>>>>>
>>>>>> For the initial Google doc and the issue (by the way it is an issue,
>>>>>> not a PR), yes both are proposing new metadata fields.
>>>>>>
>>>>>> The references I made to the modeling doc [1, 2] are reasons why new
>>>>>> APIs are not desired. The cons/concerns applicable to new MV metadata 
>>>>>> apply
>>>>>> by extension to new table and view metadata fields.
>>>>>>
>>>>>> The reason why new metadata adds complexity is that this new metadata
>>>>>> needs to be propagated to the engine API. For example, here is the 
>>>>>> ViewInfo
>>>>>> [3] class in the Spark catalog, which is used in view methods like
>>>>>> createView. Its fields correspond with the Iceberg metadata. Adding new
>>>>>> Iceberg fields should be accompanied with new fields in the engine
>>>>>> catalog/connector APIs, which was a major reason for rejecting the 
>>>>>> combined
>>>>>> MV object model as well.
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>> [2]
>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>> [3]
>>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Walaa
>>>>>>>
>>>>>>> As there may be confusion in the word 'properties', I want to double
>>>>>>> check if we are talking about the same thing here.
>>>>>>>
>>>>>>> I am reading your PR as adding lineage metadata as new key/value
>>>>>>> pair under the storage Table's 'properties' field:
>>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>>>>>>
>>>>>>> *optional* *optional* *properties* A string to string map of table
>>>>>>> properties. This is used to control settings that affect reading and
>>>>>>> writing and is not intended to be used for arbitrary metadata. For 
>>>>>>> example,
>>>>>>> commit.retry.num-retries is used to control the number of commit
>>>>>>> retries.
>>>>>>> and adding Storage Table pointer as key/value pair in the View's
>>>>>>> 'properties' field:
>>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>>>>>>
>>>>>>> *optional* properties A string to string map of view properties [2]
>>>>>>> Is that correct?
>>>>>>>
>>>>>>> On the other hand, I was talking about adding this metadata as
>>>>>>> actual fields, as is described in the Draft Spec of the Design Doc
>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>>>>>>  and
>>>>>>> first PR https://github.com/apache/iceberg/issues/6420 .
>>>>>>>
>>>>>>> Do you mean, the vote means we cannot model new fields like
>>>>>>> 'materialization' and 'lineage' as was proposed there ?    If that is 
>>>>>>> the
>>>>>>> interpretation, I am not sure I agree.  I dont fully see how new fields
>>>>>>> adds more catalog implementation complexity over new key/value 
>>>>>>> properties?
>>>>>>> To me, the vote seemed to just rule out using a combined catalog object
>>>>>>> (MaterializedView) in favor of re-using the Table and View metadata 
>>>>>>> models,
>>>>>>> not to prevent change to the Table and View model.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa <
>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Szehon,
>>>>>>>>
>>>>>>>> I think choosing separate view + table objects precludes us from
>>>>>>>> adding new metadata to table and view metadata. Here is one relevant
>>>>>>>> comment [1] from Ryan on the modeling doc, where his point is that we 
>>>>>>>> want
>>>>>>>> to avoid introducing new APIs since it requires updating every 
>>>>>>>> catalog, and
>>>>>>>> (quoting) even now, we have few implementations that support views 
>>>>>>>> because
>>>>>>>> of the problems updating back ends. Therefore, one of the major 
>>>>>>>> reasons to
>>>>>>>> avoid a new model with new metadata is to avoid adding new metadata, 
>>>>>>>> which
>>>>>>>> introduces this complexity. Here is another similar comment from 
>>>>>>>> Renjie [2]
>>>>>>>> on the cons listed for the combined object approach.
>>>>>>>>
>>>>>>>> Even Ryan's point on the MV issue that you referenced reads to me
>>>>>>>> as he is supportive of the property model. Here are some quotes:
>>>>>>>>
>>>>>>>> > We would still want some MV metadata in table *properties*.
>>>>>>>>
>>>>>>>> > I recommend instead reusing the existing snapshot metadata
>>>>>>>> structure to store what you need as snapshot *properties*.
>>>>>>>>
>>>>>>>> > First, I think we want to avoid keeping much state information in
>>>>>>>> complex table *properties*.
>>>>>>>>
>>>>>>>> Again, here, he is supportive of table properties, but wants to
>>>>>>>> make sure that the information is simple.
>>>>>>>>
>>>>>>>> > We may want additional metadata as well, like a UUID to ensure we
>>>>>>>> have the right view. I don't think we have a UUID in the view spec 
>>>>>>>> yet, but
>>>>>>>> we could add one.
>>>>>>>>
>>>>>>>> Here, he is very specific when it comes to new metadata fields, and
>>>>>>>> explicitly calls it out. That is the only new metadata field in that 
>>>>>>>> reply
>>>>>>>> and by now it is already supported. It is also not MV-specific.
>>>>>>>>
>>>>>>>> Hope this addresses your question on the property vs new metadata
>>>>>>>> model.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>> [2]
>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Walaa,
>>>>>>>>>
>>>>>>>>> I agree, I definitely do not want yet another pr/doc where
>>>>>>>>> discussion happens. as its already quite spread out :)  But did not 
>>>>>>>>> want to
>>>>>>>>> clarify some points before we get started on the discussion on your 
>>>>>>>>> PR.
>>>>>>>>>
>>>>>>>>> With reusing the table and view objects, we are not changing the
>>>>>>>>>> existing metadata of either table or view spec but rather introduce 
>>>>>>>>>> new
>>>>>>>>>> properties and formalize them to express materialized views
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On this point, I am not 100% sure that choosing to represent a
>>>>>>>>> MaterializedView as a separate View + Table object precludes us from 
>>>>>>>>> adding
>>>>>>>>> to metadata of Table or View as the Draft Spec suggested, though.  I 
>>>>>>>>> think
>>>>>>>>> this point was discussed in Jan's initial PR with a good point from 
>>>>>>>>> Ryan:
>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 
>>>>>>>>> that
>>>>>>>>> using Table Properties to track lineage is fairly brittle, and having 
>>>>>>>>> it
>>>>>>>>> formalized in the Iceberg metadata is cleaner, and that was thus the
>>>>>>>>> direction of the Draft Spec in the design doc.  What do people think?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Szehon.
>>>>>>>>>>
>>>>>>>>>> The reason for the difference is that the proposal in the Google
>>>>>>>>>> doc is based on a new MV model, hence, new metadata fields and a new
>>>>>>>>>> metadata model were being introduced (with types, optionality, etc). 
>>>>>>>>>> With
>>>>>>>>>> reusing the table and view objects, we are not changing the existing
>>>>>>>>>> metadata of either table or view spec but rather introduce new 
>>>>>>>>>> properties
>>>>>>>>>> and formalize them to express materialized views. This would be the 
>>>>>>>>>> answer
>>>>>>>>>> to most of the questions you posted on the PR (besides some naming
>>>>>>>>>> questions, which I think should be straightforward).
>>>>>>>>>>
>>>>>>>>>> With that fundamental difference, we cannot lift and shift what
>>>>>>>>>> is in the doc to any PR. Further, having consensus on separate table 
>>>>>>>>>> and
>>>>>>>>>> view objects contradicts with the point being made on having 
>>>>>>>>>> consensus on
>>>>>>>>>> the doc. We might have had agreements on some elements, but 
>>>>>>>>>> definitely not
>>>>>>>>>> on the whole doc, proven by the follow ups (also as a community, not
>>>>>>>>>> individuals).
>>>>>>>>>>
>>>>>>>>>> Therefore: we need a new space to discuss the separate table and
>>>>>>>>>> view properties.
>>>>>>>>>>
>>>>>>>>>> Is the question whether to:
>>>>>>>>>> 1- Create a new doc
>>>>>>>>>> 2- Create a new PR?
>>>>>>>>>>
>>>>>>>>>> I feel a PR is the most effective way, especially given the fact
>>>>>>>>>> that we discussed the topic a lot by now. If we agree, we can 
>>>>>>>>>> continue the
>>>>>>>>>> discussion on the PR, else, we can create a doc.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Walaa for driving it forward, looking forward to thinking
>>>>>>>>>>> about implementation of Materialized Views.
>>>>>>>>>>>
>>>>>>>>>>> I see Jan's point, the PR spec change is similar but does not
>>>>>>>>>>> seem to be completely aligned with the Draft Spec in the design doc:
>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>>>>>>>>>>> .  I left my comments on PR of those sections with the links to the
>>>>>>>>>>> difference.  I think most of those Draft Spec proposal is still 
>>>>>>>>>>> applicable
>>>>>>>>>>> after the decision to have separate Table and View objects  It will 
>>>>>>>>>>> be
>>>>>>>>>>> interesting to at least see drill a bit further why we did not 
>>>>>>>>>>> choose the
>>>>>>>>>>> approach in the Draft Spec and chose another way.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul
>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Well, everybody that actively contributed to the discussion on
>>>>>>>>>>>> the original google doc was in consensus. That's why I brought up 
>>>>>>>>>>>> the topic
>>>>>>>>>>>> at the Community Sync on the 2024-02-14 (
>>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the awareness of
>>>>>>>>>>>> the broader community. After which the discussion about the 
>>>>>>>>>>>> storage model
>>>>>>>>>>>> started. I don't think that the discussion about a single aspect 
>>>>>>>>>>>> of a
>>>>>>>>>>>> proposal should invalidate all other aspects of the proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Regardless, the state of the proposal from the original google
>>>>>>>>>>>> doc contains a lot of valuable contributions from Micah, Szehon, 
>>>>>>>>>>>> Jack, Dan,
>>>>>>>>>>>> yourself and others and it should at least provide the basis for 
>>>>>>>>>>>> any
>>>>>>>>>>>> further discussion. I don't think it's effective to start with a 
>>>>>>>>>>>> completely
>>>>>>>>>>>> different design because we are bound to have the same discussions 
>>>>>>>>>>>> all over
>>>>>>>>>>>> again.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The only consensus the community had was on the object model
>>>>>>>>>>>> through the most recent voting thread [1]. This kind of consensus 
>>>>>>>>>>>> was not
>>>>>>>>>>>> present during the doc discussions, and this should be evident 
>>>>>>>>>>>> from the
>>>>>>>>>>>> fact the last doc state listed 5 alternatives with no particular
>>>>>>>>>>>> conclusion. I am not quite sure what type of consensus we are 
>>>>>>>>>>>> referring to
>>>>>>>>>>>> here given all the follow up discussions, alternatives, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> Due to the separate object model, the PR is fundamentally
>>>>>>>>>>>> different from the doc in the sense it does not propose a new 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> model but rather formalizes some new table and view properties 
>>>>>>>>>>>> related to
>>>>>>>>>>>> MVs. That is also one reason there are no repeated discussions. 
>>>>>>>>>>>> That said,
>>>>>>>>>>>> if you feel there is a repeated discussion (which I do not see so 
>>>>>>>>>>>> far), it
>>>>>>>>>>>> would be best to link the relevant discussion from the doc in a 
>>>>>>>>>>>> comment.
>>>>>>>>>>>>
>>>>>>>>>>>> Happy to move the discussion elsewhere if there is
>>>>>>>>>>>> sufficient support for this idea, but as things stand, I do not 
>>>>>>>>>>>> see this as
>>>>>>>>>>>> an efficient way to make progress. It sounds we have been 
>>>>>>>>>>>> re-emphasizing
>>>>>>>>>>>> the same points in the last two replies, so I will let others 
>>>>>>>>>>>> chime in at
>>>>>>>>>>>> this point.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul
>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The original google doc
>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
>>>>>>>>>>>>> discussed multiple aspects of the Materialized View spec. One was 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> storage model while others were related to the metadata. After we 
>>>>>>>>>>>>> (Micah,
>>>>>>>>>>>>> Szehon, you, me) reached consensus in the google doc, Jack raised 
>>>>>>>>>>>>> his
>>>>>>>>>>>>> concern about the storage model and the long discussion about the 
>>>>>>>>>>>>> storage
>>>>>>>>>>>>> model started. Now we truly reached consensus about the storage 
>>>>>>>>>>>>> model,
>>>>>>>>>>>>> which is now also reflected in the google doc. All other aspects 
>>>>>>>>>>>>> from the
>>>>>>>>>>>>> google doc about the metadata weren't questioned and still 
>>>>>>>>>>>>> represent the
>>>>>>>>>>>>> consensus.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to *avoid repeating the discussions* in your PR
>>>>>>>>>>>>> that we already had in the google doc. Especially since we reached
>>>>>>>>>>>>> consensus which took a considerable amount of time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>>> On 08.05.24 10:21, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Jan. I think we moved on to more alignment steps beyond
>>>>>>>>>>>>> that doc a while ago. After that doc, we have discussed the topic 
>>>>>>>>>>>>> further
>>>>>>>>>>>>> in 2 dev list threads and one more doc
>>>>>>>>>>>>> <https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1>
>>>>>>>>>>>>> (with strictly two options for the storage model to consider). 
>>>>>>>>>>>>> Moreover,
>>>>>>>>>>>>> the original doc grew to 14 pages long with one section comparing 
>>>>>>>>>>>>> 5 design
>>>>>>>>>>>>> alternatives, which made things harder to reach consensus. The 
>>>>>>>>>>>>> lack of
>>>>>>>>>>>>> consensus is what partly led up to the subsequent discussions and 
>>>>>>>>>>>>> call for
>>>>>>>>>>>>> a more focused approach to reach consensus. If we already have a 
>>>>>>>>>>>>> consensus
>>>>>>>>>>>>> on the storage model (separate tables and views), I think we 
>>>>>>>>>>>>> should take
>>>>>>>>>>>>> things further and have continued focused discussions on the 
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> metadata in the form of a PR. I have included all previous 
>>>>>>>>>>>>> discussions
>>>>>>>>>>>>> including the original doc and issue as references in the PR 
>>>>>>>>>>>>> description.
>>>>>>>>>>>>> Please let me know if this works. Happy to hear others' thoughts 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> best way to move forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 8, 2024 at 12:56 AM Jan Kaul
>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Walaa for trying to move things along. However I don't
>>>>>>>>>>>>>> think it's a good idea to start a separate discussion about the 
>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>> for materialized views because we already had this discussion 
>>>>>>>>>>>>>> and reached
>>>>>>>>>>>>>> consensus in this google doc:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Once the draft is finalized we can adopt the PR to reflect
>>>>>>>>>>>>>> the consensus from the google doc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jan
>>>>>>>>>>>>>> On 07.05.24 19:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Steven. I feel it is needed so the MV spec is not
>>>>>>>>>>>>>> scattered across the table and view spec pages. We may add a 
>>>>>>>>>>>>>> reference in
>>>>>>>>>>>>>> each respective properties section.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 7, 2024 at 10:04 AM Steven Wu <
>>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Walaa, thanks for initiating the next step.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With the agreed model of separate view and storage table, I
>>>>>>>>>>>>>>> am wondering if a separate materialized view spec page is 
>>>>>>>>>>>>>>> needed. E.g., the
>>>>>>>>>>>>>>> new view metadata (view-materialized and view-storage-table) is 
>>>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>> good to be added to the view page directly to avoid information 
>>>>>>>>>>>>>>> scattering.
>>>>>>>>>>>>>>> The same can be said about the storage table metadata.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We may keep the separate materialized view page to document
>>>>>>>>>>>>>>> motivation, freshness semantics, etc..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, May 6, 2024 at 10:58 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks again for participating in the modeling discussion
>>>>>>>>>>>>>>>> [1]. Since the outcome of this discussion was to model 
>>>>>>>>>>>>>>>> materialized views
>>>>>>>>>>>>>>>> as separate objects, an Iceberg view and a table, I think the 
>>>>>>>>>>>>>>>> next step
>>>>>>>>>>>>>>>> should be discussing the metadata details for each object. I 
>>>>>>>>>>>>>>>> have created a
>>>>>>>>>>>>>>>> PR https://github.com/apache/iceberg/pull/10280 with an
>>>>>>>>>>>>>>>> initial spec improvement. Please feel free to review it and 
>>>>>>>>>>>>>>>> leave feedback
>>>>>>>>>>>>>>>> there.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>

Re: Materialized Views: Next Steps

Reply via email to