Re: Materialized Views: Next Steps

Walaa Eldin Moustafa Fri, 17 May 2024 15:16:48 -0700

I am not in favor of expanding the spec for use cases that do not directly
serve materialized views. Identifying general lineage is a separate problem
that is also applicable to non-materialized views so maybe that’s worth
discussing in a separate spec. If there is a use case for timestamp or
snapshot level properties for materialize views, we can discuss them but so
far I feel they are redundant. What do you think?


Thanks,
Walaa.

On Fri, May 17, 2024 at 3:05 PM Benny Chow <btc...@gmail.com> wrote:

> I think it’s still worthwhile to include the snapshot and timestamp refs
> for completeness sake.
>
> Also, Jan brought up interesting use case with BI tool using the MV
> without SQL representation.  The BI tool can get all table and view
> dependencies if the lineage is complete.
>
> Thanks
>
>
> On May 17, 2024, at 1:35 PM, Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
> 
>
> Sounds good. I am assuming we agree it is not required for either snapshot
> or timestamp?
>
> Thanks,
> Walaa.
>
>
> On Fri, May 17, 2024 at 1:17 PM Benny Chow <btc...@gmail.com> wrote:
>
>> I like Jack's suggestions to capture the ref type and value!  When the
>> ref type is branch, the snapshot id is dynamic and so the engine using the
>> MV can validate that the latest snapshot on a branch matches the branch
>> snapshot at the time of materialization.
>>
>> I think if we do this then we don't need to precisely identify the same
>> table (at different snapshots) in the MV's query tree.  So, we don't need
>> to capture any additional properties like alias, parent view, path to root,
>> sequence number, etc.
>>
>> Thanks
>> Benny
>>
>> On Fri, May 17, 2024 at 11:20 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Thanks Jack, and welcome back!
>>>
>>> Taking a step back, I understand the initial concern was that a table
>>> name (e.g., t1 in your example) would be referenced multiple times in the
>>> view definition and each reference is associated with a different snapshot
>>> ID, hence UUID is not sufficient to capture each occurrence/reference. I
>>> proposed:
>>>
>>> * The solution to track unique occurrences is to use something along the
>>> lines of the SQL alias (e.g., "t1" for the first occurrence and "t2" from
>>> "as t2" in your example) to uniquely identify each occurrence -- we can
>>> tweak the representation and explore how to handle this in case of nested
>>> queries, etc, but alias is the main concept to track uniqueness.
>>> * However, since this leads to a series of open ended problems, I have
>>> also suggested avoiding this complexity altogether and not supporting time
>>> travel in MVs for now.
>>>
>>> However, thinking again, are not time travel queries in MVs
>>> self-containing the exact snapshot ID that we are trying to track in the
>>> properties? Looks like this information is already encoded in the query and
>>> there is no need to capture it externally.
>>>
>>> For example, if the MV definition consists of table references where all
>>> of the references are bound to specific snapshot IDs or timestamps, then
>>> the storage table is always fresh no matter if the underlying tables
>>> change. Tracking snapshot IDs in the storage table is only required for
>>> table references that are not pinned to a specific snapshot ID/timestamp in
>>> the view definition, for which UUID is sufficient.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Fri, May 17, 2024 at 9:51 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Hi everyone, just want to say I am back from the leave, and currently
>>>> catching up with the threads, I will make more comments later after knowing
>>>> more details of what has been going on. Looks like we've made great
>>>> progress!
>>>>
>>>> Just my 2 cents on the current properties vs metadata field discussion.
>>>> The proposed properties are:
>>>> - in view:
>>>>   1. a boolean flag to indicate a view is a MV
>>>>   2. a pointer to the storage table
>>>> - in storage table:
>>>>   3. view version that is materialized
>>>>   4. a prefix-based map to describe the snapshot version of the base
>>>> tables that are materialized
>>>>   5. a prefix-based map to describe the version of child views that are
>>>> materialized
>>>>
>>>> For 1, 2, and 3, these are all pretty simple and can be just
>>>> properties. I guess 4 and 5 are the main ones that seem complex and can be
>>>> more formalized as metadata fields. I think the time travel cases Bunny
>>>> brought up might be good ones to look into more details:
>>>>
>>>> For direct version travel, I think the base table version serves as the
>>>> default. If you have a MV query like
>>>>
>>>> SELECT * FROM
>>>>   t1,
>>>>   t1 FOR SYSTEM_VERSION AS OF 987654 as t2
>>>>   WHERE t1.c1 = t2.c1
>>>>
>>>> and in the storage table it says t1 maps to snapshot id 123456, then
>>>> the query is still not ambiguous, it should be interpreted as
>>>>
>>>> SELECT * FROM
>>>>   t1 FOR SYSTEM_VERSION AS OF 123456,
>>>>   t1 FOR SYSTEM_VERSION AS OF 987654 as t2
>>>>   WHERE t1.c1 = t2.c1
>>>>
>>>> For ref travel, the specific ref version needs to be fixed at MV
>>>> creation time:
>>>>
>>>> SELECT * FROM
>>>>   t1,
>>>>   t1 FOR SYSTEM_VERSION AS OF '2024-Q1' as t2
>>>>   WHERE t1.c1 = t2.c1
>>>>
>>>> Just storing table UUID is not sufficient. In a property-based
>>>> approach, we need something like
>>>> base.table.<table>.ref.<ref-name>=<snapshot-id>.
>>>>
>>>> Time travel is similar to ref travel:
>>>>
>>>> SELECT * FROM
>>>>   t1,
>>>>   t1 FOR SYSTEM_TIME AS OF timestamp '2024-01-01' as t2
>>>>   WHERE t1.c1 = t2.c1
>>>>
>>>> In a property-based approach, we need something like
>>>> base.table.<table>.time.<timestamp>=<snapshot-id>.
>>>>
>>>> Technically this is indeed getting increasingly complex, so I can get
>>>> why many of us say this property-based approach is quite brittle. However,
>>>> it seems like it can still work as we extend the property structure.
>>>> Personally speaking I am leaning more towards the property-based approach
>>>> just for its simplicity, but I need to think more about other use cases as
>>>> well.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Thu, May 16, 2024 at 10:21 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> I think this is orthogonal to the property vs metadata field since
>>>>> instead of representing the property as `base.table.[UUID]` it would be
>>>>> something like `base.table.[alias]` where `alias` is the specific
>>>>> occurrence of the table in the query according to its alias (and SELECT
>>>>> scope possibly, which kind of opens the door to further complexities, but
>>>>> for the sake of argument -- there is a mapping to properties too).
>>>>>
>>>>> Another question: assuming we go with the top level metadata model,
>>>>> will we still expose this metadata on the engine side as properties? What
>>>>> would the property names be?
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Thu, May 16, 2024 at 9:55 PM Benny Chow <btc...@gmail.com> wrote:
>>>>>
>>>>>> Sounds good.
>>>>>>
>>>>>> Another benefit of the struct model is that it's more extensible in
>>>>>> the future when we need to disambiguate the same table that appears
>>>>>> multiple times in the MV query tree.
>>>>>> This could happen with time travel queries or branching.  We may end
>>>>>> up adding additional properties like a sequence number, parent view or 
>>>>>> path
>>>>>> to root.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Thu, May 16, 2024 at 3:57 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Benny, I have responded to the comment.
>>>>>>>
>>>>>>> I would suggest that we use this thread to evaluate properties model
>>>>>>> vs top level metadata model (to avoid discussion drift).
>>>>>>>
>>>>>>> If we have feedback on the actual properties used in the properties
>>>>>>> model as defined in the PR, we can have the discussion there.
>>>>>>>
>>>>>>> THanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 16, 2024 at 3:22 PM Benny Chow <btc...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Walaa
>>>>>>>>
>>>>>>>> I left comments in your spec PR:
>>>>>>>> https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169
>>>>>>>>  My last question about use cases was really about incremental refresh 
>>>>>>>> with
>>>>>>>> aggregates.  But I think this might be too complicated to try to
>>>>>>>> model/discuss now and so I agree with Micah's comment about doing it 
>>>>>>>> in a
>>>>>>>> future iteration.
>>>>>>>>
>>>>>>>> Hi Jan,
>>>>>>>>
>>>>>>>> Regarding storing the identifiers, I like the idea too.  Dremio's
>>>>>>>> query engine supports MVs on sources besides Iceberg tables.  Here's
>>>>>>>> everything that's in a single lineage entry:
>>>>>>>> https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80
>>>>>>>> The lineage is stored as a graph and not a list of entries.  I think 
>>>>>>>> for
>>>>>>>> what we are trying to achieve, it's more practical to limit the 
>>>>>>>> lineage to
>>>>>>>> Iceberg sources.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Benny
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 15, 2024 at 12:06 AM Jan Kaul
>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>
>>>>>>>>> I agree with Szehon and Benny that storing the lineage information
>>>>>>>>> as multiple table properties is too brittle, especially for many 
>>>>>>>>> source
>>>>>>>>> tables (base tables). I would prefer to have the whole lineage 
>>>>>>>>> information
>>>>>>>>> in one entry as it is more concise. This is also how Trino has been 
>>>>>>>>> doing
>>>>>>>>> it, as you can see here
>>>>>>>>> <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> As we've discussed in the google doc
>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>:
>>>>>>>>> it is helpful to also store the table identifiers of the source 
>>>>>>>>> tables to
>>>>>>>>> enable clients to determine the freshness of the MV that don't 
>>>>>>>>> understand
>>>>>>>>> the SQL dialect of the MV definition, like other query engines, BI 
>>>>>>>>> tools
>>>>>>>>> and Dataframe libraries. This is also how Trino is doing it. That's 
>>>>>>>>> why we
>>>>>>>>> chose the design in the google doc.
>>>>>>>>>
>>>>>>>>> Storing the storage table identifier as a property might work.
>>>>>>>>>
>>>>>>>>> Thanks, Jan
>>>>>>>>> On 15.05.24 02:38, Walaa Eldin Moustafa wrote:
>>>>>>>>>
>>>>>>>>> Thanks Benny. My specific thoughts about the spec and the
>>>>>>>>> properties are captured in the spec PR
>>>>>>>>> https://github.com/apache/iceberg/pull/10280. The spec is also
>>>>>>>>> implemented in the Spark implementation PR
>>>>>>>>> https://github.com/apache/iceberg/pull/9830, and I believe this
>>>>>>>>> follows the same nature of how the information was captured in 
>>>>>>>>> Netflix's
>>>>>>>>> implementation with Spark, and Trino implementation (prior to 
>>>>>>>>> formalizing
>>>>>>>>> through that spec), both of which have been used reliably for years. I
>>>>>>>>> think it also aligns with Ryan's feedback here
>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 
>>>>>>>>> which
>>>>>>>>> indicated the usage of properties.
>>>>>>>>>
>>>>>>>>> The reasons for choosing properties:
>>>>>>>>> * Not every table is a storage table and not every view is a
>>>>>>>>> materialized view. I feel exposing the info as top level metadata is 
>>>>>>>>> an
>>>>>>>>> overkill for the original object type.
>>>>>>>>> * The properties are simple. They contain either single snapshot
>>>>>>>>> ID each, or single view version each, or lastly, the storage table
>>>>>>>>> identifier. Engines can use them without issues (also as shown in the
>>>>>>>>> implementation).
>>>>>>>>> * To be meaningful, the metadata fields should be captured in the
>>>>>>>>> engine API as well, which is an effort that has to be pursued outside 
>>>>>>>>> the
>>>>>>>>> Iceberg community. Until engine APIs evolve, we would have to define a
>>>>>>>>> mapping between Iceberg metadata fields and engine properties (only 
>>>>>>>>> current
>>>>>>>>> place in engine side to capture this info). This requires an 
>>>>>>>>> additional
>>>>>>>>> spec on its own, and it will introduce complexities. Hence it is 
>>>>>>>>> always
>>>>>>>>> cleaner to map Iceberg properties to engine properties and Iceberg 
>>>>>>>>> metadata
>>>>>>>>> to designated engine APIs. Mixing and matching (e.g., Iceberg metadata
>>>>>>>>> fields as engine properties) just for the lack of other cleaner 
>>>>>>>>> options
>>>>>>>>> does not sound like a good idea in both short and long term.
>>>>>>>>>
>>>>>>>>> Let me know your thoughts.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I agree with Szheon here.  I think storing the materialization
>>>>>>>>>> lineage as a bunch of properties is brittle.  This lineage 
>>>>>>>>>> information is
>>>>>>>>>> needed by engines to validate the staleness of a materialization and 
>>>>>>>>>> also
>>>>>>>>>> to perform full or incremental refreshes.  There’s a lot to capture 
>>>>>>>>>> here.
>>>>>>>>>>
>>>>>>>>>> Maybe we should drill down into the use cases first - such as
>>>>>>>>>> incremental refresh with aggregates?  (Pick a harder one first 😀)
>>>>>>>>>>
>>>>>>>>>> I left a few comments about this in the doc.  I wonder what are
>>>>>>>>>> your thoughts here Walaa?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa <
>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>> Thanks John. The current metadata does not sound complex. We need
>>>>>>>>>> to track the underlying table snapshot IDs as well as the view 
>>>>>>>>>> version ID.
>>>>>>>>>> I agree as long as it is simple and before this feature fully 
>>>>>>>>>> matures, we
>>>>>>>>>> should track it in properties.
>>>>>>>>>>
>>>>>>>>>> One important factor for me (apart from the API effort,
>>>>>>>>>> especially on the engine side), is that not each table is an MV 
>>>>>>>>>> storage
>>>>>>>>>> table. Surfacing MV-specific storage table properties as first class 
>>>>>>>>>> table
>>>>>>>>>> metadata sounds to impose this metadata on every table, when it is 
>>>>>>>>>> not
>>>>>>>>>> required for normal table operation (yes, they can be optional, but 
>>>>>>>>>> it does
>>>>>>>>>> not sound like the use case warrants exposing them as metadata 
>>>>>>>>>> fields yet).
>>>>>>>>>>
>>>>>>>>>> Similarly, since not every view is a materialized view, it sounds
>>>>>>>>>> reasonable to keep MV-specific data in properties.
>>>>>>>>>>
>>>>>>>>>> UUID (for views), on the other hand, is common to all views,
>>>>>>>>>> hence it made sense to add it as a top level field.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Szheon,
>>>>>>>>>>>
>>>>>>>>>>> While I fully share your concern of abusing table properties, we
>>>>>>>>>>> took the approach of option 1 and run it in production for several 
>>>>>>>>>>> years:
>>>>>>>>>>>
>>>>>>>>>>>    - the feature was still evolving
>>>>>>>>>>>    - quick and simple implementation
>>>>>>>>>>>    - table properties are simple enough and not confusing
>>>>>>>>>>>    - haven't seen an urgent need to convert the properties to
>>>>>>>>>>>    metadata fields and add API
>>>>>>>>>>>    - do not wish on-disk changes (requiring lengthy
>>>>>>>>>>>    tedious migration)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That said, I am open to codifying the mv metadata into api and
>>>>>>>>>>> spec, with the following considerations
>>>>>>>>>>>
>>>>>>>>>>>    - mv metadata has reached maturity and consensus (could be
>>>>>>>>>>>    just a core portion)
>>>>>>>>>>>    - when mv metadata becomes too complex
>>>>>>>>>>>    - wish to support use cases that are quicker to adopt API
>>>>>>>>>>>    changes (than engines), e.g., using Iceberg library to 
>>>>>>>>>>> manipulate MVs, or
>>>>>>>>>>>    parsing metadata files directly
>>>>>>>>>>>    - Spark view catalog API can evolve separately from Iceberg
>>>>>>>>>>>    API and spec changes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks all for the great discussion!
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <
>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Szheon,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the follow-up. It is possible some of the concerns
>>>>>>>>>>>> were referring to the backend catalogs, but it is all connected. 
>>>>>>>>>>>> My main
>>>>>>>>>>>> personal concern is from the engine connector APIs point of view, 
>>>>>>>>>>>> but I
>>>>>>>>>>>> share the concern about the catalogs.
>>>>>>>>>>>>
>>>>>>>>>>>> I think everyone's concern is not about the complexity* per*
>>>>>>>>>>>> backend catalog/engine catalog API (in which case adding new 
>>>>>>>>>>>> metadata to
>>>>>>>>>>>> existing objects could require less code), but rather about the
>>>>>>>>>>>> *number* of APIs and implementations that need to change (in
>>>>>>>>>>>> which case both new metadata to existing objects and new objects 
>>>>>>>>>>>> altogether
>>>>>>>>>>>> introduce equal complexity).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <
>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK thanks for confirming.  I am still not 100% in agreement,
>>>>>>>>>>>>> my understanding of the rationale for separate Table/View objects 
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> comment that you linked:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the biggest problem with this is that we would need to
>>>>>>>>>>>>>> modify every catalog to support this combination and that would 
>>>>>>>>>>>>>> be really
>>>>>>>>>>>>>> difficult.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> is about JavaCatalogs /REST Catalog needing to to support
>>>>>>>>>>>>> creating , persisting, and loading a MaterializedView object, 
>>>>>>>>>>>>> which is much
>>>>>>>>>>>>> more complex.  See HiveView PR for example :
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9852   We would have
>>>>>>>>>>>>> to do the same exercise for persisting MV.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In our case though, there's not much complexity regardless of
>>>>>>>>>>>>> approach ('properties' or new metadata fields), in terms of Java
>>>>>>>>>>>>> Catalog/REST Catalog.  It's mostly pass-through to storage.  
>>>>>>>>>>>>> Looks like you
>>>>>>>>>>>>> are referring to Spark's View model in terms of complexity, which 
>>>>>>>>>>>>> may be a
>>>>>>>>>>>>> different story, but not sure if it is a good rationale to make 
>>>>>>>>>>>>> Iceberg to
>>>>>>>>>>>>> use 'properties' .
>>>>>>>>>>>>>
>>>>>>>>>>>>> 'properties'  is for read/write configurations, not to save
>>>>>>>>>>>>> metadatas.  To me, its also brittle to save important metadata, 
>>>>>>>>>>>>> as it's not
>>>>>>>>>>>>> in the defined schema.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A string to string map of table properties. This is used to
>>>>>>>>>>>>>> control settings that affect reading and writing and is not 
>>>>>>>>>>>>>> intended to be
>>>>>>>>>>>>>> used for arbitrary metadata.  For example,
>>>>>>>>>>>>>> commit.retry.num-retries is used to control the number of
>>>>>>>>>>>>>> commit retries.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the other hand, the Draft Spec suggests to save `lineage`
>>>>>>>>>>>>> as a modeled field on the Storage Table's snapshot metadata.  
>>>>>>>>>>>>> This allows
>>>>>>>>>>>>> you to 'time travel', 'branch', and have this metadata life cycle
>>>>>>>>>>>>> integrated via normal snapshots lifecycle operations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So that's my rationale.  Not sure if we can come to an
>>>>>>>>>>>>> agreement over email though, and may need others to chime in as 
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, you are reading the PR correctly, and interpreting the
>>>>>>>>>>>>>> meaning of properties correctly. I think the reply you pasted 
>>>>>>>>>>>>>> from Ryan
>>>>>>>>>>>>>> refers to the same concept as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the initial Google doc and the issue (by the way it is an
>>>>>>>>>>>>>> issue, not a PR), yes both are proposing new metadata fields.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The references I made to the modeling doc [1, 2] are reasons
>>>>>>>>>>>>>> why new APIs are not desired. The cons/concerns applicable to 
>>>>>>>>>>>>>> new MV
>>>>>>>>>>>>>> metadata apply by extension to new table and view metadata 
>>>>>>>>>>>>>> fields.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The reason why new metadata adds complexity is that this new
>>>>>>>>>>>>>> metadata needs to be propagated to the engine API. For example, 
>>>>>>>>>>>>>> here is the
>>>>>>>>>>>>>> ViewInfo [3] class in the Spark catalog, which is used in view 
>>>>>>>>>>>>>> methods like
>>>>>>>>>>>>>> createView. Its fields correspond with the Iceberg metadata. 
>>>>>>>>>>>>>> Adding new
>>>>>>>>>>>>>> Iceberg fields should be accompanied with new fields in the 
>>>>>>>>>>>>>> engine
>>>>>>>>>>>>>> catalog/connector APIs, which was a major reason for rejecting 
>>>>>>>>>>>>>> the combined
>>>>>>>>>>>>>> MV object model as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <
>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As there may be confusion in the word 'properties', I want
>>>>>>>>>>>>>>> to double check if we are talking about the same thing here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am reading your PR as adding lineage metadata as new
>>>>>>>>>>>>>>> key/value pair under the storage Table's 'properties' field:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *optional* *optional* *properties* A string to string map
>>>>>>>>>>>>>>> of table properties. This is used to control settings that 
>>>>>>>>>>>>>>> affect reading
>>>>>>>>>>>>>>> and writing and is not intended to be used for arbitrary 
>>>>>>>>>>>>>>> metadata. For
>>>>>>>>>>>>>>> example, commit.retry.num-retries is used to control the
>>>>>>>>>>>>>>> number of commit retries.
>>>>>>>>>>>>>>> and adding Storage Table pointer as key/value pair in the
>>>>>>>>>>>>>>> View's 'properties' field:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *optional* properties A string to string map of view
>>>>>>>>>>>>>>> properties [2]
>>>>>>>>>>>>>>> Is that correct?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On the other hand, I was talking about adding this metadata
>>>>>>>>>>>>>>> as actual fields, as is described in the Draft Spec of the 
>>>>>>>>>>>>>>> Design Doc
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>>>>>>>>>>>>>>  and
>>>>>>>>>>>>>>> first PR https://github.com/apache/iceberg/issues/6420 .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you mean, the vote means we cannot model new fields like
>>>>>>>>>>>>>>> 'materialization' and 'lineage' as was proposed there ?    If 
>>>>>>>>>>>>>>> that is the
>>>>>>>>>>>>>>> interpretation, I am not sure I agree.  I dont fully see how 
>>>>>>>>>>>>>>> new fields
>>>>>>>>>>>>>>> adds more catalog implementation complexity over new key/value 
>>>>>>>>>>>>>>> properties?
>>>>>>>>>>>>>>> To me, the vote seemed to just rule out using a combined 
>>>>>>>>>>>>>>> catalog object
>>>>>>>>>>>>>>> (MaterializedView) in favor of re-using the Table and View 
>>>>>>>>>>>>>>> metadata models,
>>>>>>>>>>>>>>> not to prevent change to the Table and View model.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think choosing separate view + table objects precludes us
>>>>>>>>>>>>>>>> from adding new metadata to table and view metadata. Here is 
>>>>>>>>>>>>>>>> one relevant
>>>>>>>>>>>>>>>> comment [1] from Ryan on the modeling doc, where his point is 
>>>>>>>>>>>>>>>> that we want
>>>>>>>>>>>>>>>> to avoid introducing new APIs since it requires updating every 
>>>>>>>>>>>>>>>> catalog, and
>>>>>>>>>>>>>>>> (quoting) even now, we have few implementations that support 
>>>>>>>>>>>>>>>> views because
>>>>>>>>>>>>>>>> of the problems updating back ends. Therefore, one of the 
>>>>>>>>>>>>>>>> major reasons to
>>>>>>>>>>>>>>>> avoid a new model with new metadata is to avoid adding new 
>>>>>>>>>>>>>>>> metadata, which
>>>>>>>>>>>>>>>> introduces this complexity. Here is another similar comment 
>>>>>>>>>>>>>>>> from Renjie [2]
>>>>>>>>>>>>>>>> on the cons listed for the combined object approach.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Even Ryan's point on the MV issue that you referenced reads
>>>>>>>>>>>>>>>> to me as he is supportive of the property model. Here are some 
>>>>>>>>>>>>>>>> quotes:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > We would still want some MV metadata in table
>>>>>>>>>>>>>>>> *properties*.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > I recommend instead reusing the existing snapshot
>>>>>>>>>>>>>>>> metadata structure to store what you need as snapshot 
>>>>>>>>>>>>>>>> *properties*.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > First, I think we want to avoid keeping much state
>>>>>>>>>>>>>>>> information in complex table *properties*.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Again, here, he is supportive of table properties, but
>>>>>>>>>>>>>>>> wants to make sure that the information is simple.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > We may want additional metadata as well, like a UUID to
>>>>>>>>>>>>>>>> ensure we have the right view. I don't think we have a UUID in 
>>>>>>>>>>>>>>>> the view
>>>>>>>>>>>>>>>> spec yet, but we could add one.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here, he is very specific when it comes to new metadata
>>>>>>>>>>>>>>>> fields, and explicitly calls it out. That is the only new 
>>>>>>>>>>>>>>>> metadata field in
>>>>>>>>>>>>>>>> that reply and by now it is already supported. It is also not 
>>>>>>>>>>>>>>>> MV-specific.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hope this addresses your question on the property vs new
>>>>>>>>>>>>>>>> metadata model.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <
>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Walaa,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree, I definitely do not want yet another pr/doc where
>>>>>>>>>>>>>>>>> discussion happens. as its already quite spread out :)  But 
>>>>>>>>>>>>>>>>> did not want to
>>>>>>>>>>>>>>>>> clarify some points before we get started on the discussion 
>>>>>>>>>>>>>>>>> on your PR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With reusing the table and view objects, we are not
>>>>>>>>>>>>>>>>>> changing the existing metadata of either table or view spec 
>>>>>>>>>>>>>>>>>> but rather
>>>>>>>>>>>>>>>>>> introduce new properties and formalize them to express 
>>>>>>>>>>>>>>>>>> materialized views
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On this point, I am not 100% sure that choosing to
>>>>>>>>>>>>>>>>> represent a MaterializedView as a separate View + Table 
>>>>>>>>>>>>>>>>> object precludes us
>>>>>>>>>>>>>>>>> from adding to metadata of Table or View as the Draft Spec 
>>>>>>>>>>>>>>>>> suggested,
>>>>>>>>>>>>>>>>> though.  I think this point was discussed in Jan's initial PR 
>>>>>>>>>>>>>>>>> with a good
>>>>>>>>>>>>>>>>> point from Ryan:
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
>>>>>>>>>>>>>>>>>  that
>>>>>>>>>>>>>>>>> using Table Properties to track lineage is fairly brittle, 
>>>>>>>>>>>>>>>>> and having it
>>>>>>>>>>>>>>>>> formalized in the Iceberg metadata is cleaner, and that was 
>>>>>>>>>>>>>>>>> thus the
>>>>>>>>>>>>>>>>> direction of the Draft Spec in the design doc.  What do 
>>>>>>>>>>>>>>>>> people think?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Szehon.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The reason for the difference is that the proposal in the
>>>>>>>>>>>>>>>>>> Google doc is based on a new MV model, hence, new metadata 
>>>>>>>>>>>>>>>>>> fields and a new
>>>>>>>>>>>>>>>>>> metadata model were being introduced (with types, 
>>>>>>>>>>>>>>>>>> optionality, etc). With
>>>>>>>>>>>>>>>>>> reusing the table and view objects, we are not changing the 
>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>> metadata of either table or view spec but rather introduce 
>>>>>>>>>>>>>>>>>> new properties
>>>>>>>>>>>>>>>>>> and formalize them to express materialized views. This would 
>>>>>>>>>>>>>>>>>> be the answer
>>>>>>>>>>>>>>>>>> to most of the questions you posted on the PR (besides some 
>>>>>>>>>>>>>>>>>> naming
>>>>>>>>>>>>>>>>>> questions, which I think should be straightforward).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With that fundamental difference, we cannot lift and
>>>>>>>>>>>>>>>>>> shift what is in the doc to any PR. Further, having 
>>>>>>>>>>>>>>>>>> consensus on separate
>>>>>>>>>>>>>>>>>> table and view objects contradicts with the point being made 
>>>>>>>>>>>>>>>>>> on having
>>>>>>>>>>>>>>>>>> consensus on the doc. We might have had agreements on some 
>>>>>>>>>>>>>>>>>> elements, but
>>>>>>>>>>>>>>>>>> definitely not on the whole doc, proven by the follow ups 
>>>>>>>>>>>>>>>>>> (also as a
>>>>>>>>>>>>>>>>>> community, not individuals).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Therefore: we need a new space to discuss the separate
>>>>>>>>>>>>>>>>>> table and view properties.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is the question whether to:
>>>>>>>>>>>>>>>>>> 1- Create a new doc
>>>>>>>>>>>>>>>>>> 2- Create a new PR?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I feel a PR is the most effective way, especially given
>>>>>>>>>>>>>>>>>> the fact that we discussed the topic a lot by now. If we 
>>>>>>>>>>>>>>>>>> agree, we can
>>>>>>>>>>>>>>>>>> continue the discussion on the PR, else, we can create a doc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <
>>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks Walaa for driving it forward, looking forward to
>>>>>>>>>>>>>>>>>>> thinking about implementation of Materialized Views.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I see Jan's point, the PR spec change is similar but
>>>>>>>>>>>>>>>>>>> does not seem to be completely aligned with the Draft Spec 
>>>>>>>>>>>>>>>>>>> in the design
>>>>>>>>>>>>>>>>>>> doc:
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>>>>>>>>>>>>>>>>>>> .  I left my comments on PR of those sections with the 
>>>>>>>>>>>>>>>>>>> links to the
>>>>>>>>>>>>>>>>>>> difference.  I think most of those Draft Spec proposal is 
>>>>>>>>>>>>>>>>>>> still applicable
>>>>>>>>>>>>>>>>>>> after the decision to have separate Table and View objects  
>>>>>>>>>>>>>>>>>>> It will be
>>>>>>>>>>>>>>>>>>> interesting to at least see drill a bit further why we did 
>>>>>>>>>>>>>>>>>>> not choose the
>>>>>>>>>>>>>>>>>>> approach in the Draft Spec and chose another way.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul
>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Well, everybody that actively contributed to the
>>>>>>>>>>>>>>>>>>>> discussion on the original google doc was in consensus. 
>>>>>>>>>>>>>>>>>>>> That's why I
>>>>>>>>>>>>>>>>>>>> brought up the topic at the Community Sync on the 
>>>>>>>>>>>>>>>>>>>> 2024-02-14 (
>>>>>>>>>>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the
>>>>>>>>>>>>>>>>>>>> awareness of the broader community. After which the 
>>>>>>>>>>>>>>>>>>>> discussion about the
>>>>>>>>>>>>>>>>>>>> storage model started. I don't think that the discussion 
>>>>>>>>>>>>>>>>>>>> about a single
>>>>>>>>>>>>>>>>>>>> aspect of a proposal should invalidate all other aspects 
>>>>>>>>>>>>>>>>>>>> of the proposal.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regardless, the state of the proposal from the original
>>>>>>>>>>>>>>>>>>>> google doc contains a lot of valuable contributions from 
>>>>>>>>>>>>>>>>>>>> Micah, Szehon,
>>>>>>>>>>>>>>>>>>>> Jack, Dan, yourself and others and it should at least 
>>>>>>>>>>>>>>>>>>>> provide the basis for
>>>>>>>>>>>>>>>>>>>> any further discussion. I don't think it's effective to 
>>>>>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>>>>>> completely different design because we are bound to have 
>>>>>>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>>>>>>> discussions all over again.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The only consensus the community had was on the object
>>>>>>>>>>>>>>>>>>>> model through the most recent voting thread [1]. This kind 
>>>>>>>>>>>>>>>>>>>> of consensus was
>>>>>>>>>>>>>>>>>>>> not present during the doc discussions, and this should be 
>>>>>>>>>>>>>>>>>>>> evident from the
>>>>>>>>>>>>>>>>>>>> fact the last doc state listed 5 alternatives with no 
>>>>>>>>>>>>>>>>>>>> particular
>>>>>>>>>>>>>>>>>>>> conclusion. I am not quite sure what type of consensus we 
>>>>>>>>>>>>>>>>>>>> are referring to
>>>>>>>>>>>>>>>>>>>> here given all the follow up discussions, alternatives, 
>>>>>>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Due to the separate object model, the PR is
>>>>>>>>>>>>>>>>>>>> fundamentally different from the doc in the sense it does 
>>>>>>>>>>>>>>>>>>>> not propose a new
>>>>>>>>>>>>>>>>>>>> metadata model but rather formalizes some new table and 
>>>>>>>>>>>>>>>>>>>> view properties
>>>>>>>>>>>>>>>>>>>> related to MVs. That is also one reason there are no 
>>>>>>>>>>>>>>>>>>>> repeated discussions.
>>>>>>>>>>>>>>>>>>>> That said, if you feel there is a repeated discussion 
>>>>>>>>>>>>>>>>>>>> (which I do not see
>>>>>>>>>>>>>>>>>>>> so far), it would be best to link the relevant discussion 
>>>>>>>>>>>>>>>>>>>> from the doc in a
>>>>>>>>>>>>>>>>>>>> comment.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Happy to move the discussion elsewhere if there is
>>>>>>>>>>>>>>>>>>>> sufficient support for this idea, but as things stand, I 
>>>>>>>>>>>>>>>>>>>> do not see this as
>>>>>>>>>>>>>>>>>>>> an efficient way to make progress. It sounds we have been 
>>>>>>>>>>>>>>>>>>>> re-emphasizing
>>>>>>>>>>>>>>>>>>>> the same points in the last two replies, so I will let 
>>>>>>>>>>>>>>>>>>>> others chime in at
>>>>>>>>>>>>>>>>>>>> this point.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul
>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The original google doc
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
>>>>>>>>>>>>>>>>>>>>> discussed multiple
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: Materialized Views: Next Steps

Reply via email to