I am not in favor of expanding the spec for use cases that do not directly serve materialized views. Identifying general lineage is a separate problem that is also applicable to non-materialized views so maybe that’s worth discussing in a separate spec. If there is a use case for timestamp or snapshot level properties for materialize views, we can discuss them but so far I feel they are redundant. What do you think?
Thanks, Walaa. On Fri, May 17, 2024 at 3:05 PM Benny Chow <btc...@gmail.com> wrote: > I think it’s still worthwhile to include the snapshot and timestamp refs > for completeness sake. > > Also, Jan brought up interesting use case with BI tool using the MV > without SQL representation. The BI tool can get all table and view > dependencies if the lineage is complete. > > Thanks > > > On May 17, 2024, at 1:35 PM, Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > > > > Sounds good. I am assuming we agree it is not required for either snapshot > or timestamp? > > Thanks, > Walaa. > > > On Fri, May 17, 2024 at 1:17 PM Benny Chow <btc...@gmail.com> wrote: > >> I like Jack's suggestions to capture the ref type and value! When the >> ref type is branch, the snapshot id is dynamic and so the engine using the >> MV can validate that the latest snapshot on a branch matches the branch >> snapshot at the time of materialization. >> >> I think if we do this then we don't need to precisely identify the same >> table (at different snapshots) in the MV's query tree. So, we don't need >> to capture any additional properties like alias, parent view, path to root, >> sequence number, etc. >> >> Thanks >> Benny >> >> On Fri, May 17, 2024 at 11:20 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Jack, and welcome back! >>> >>> Taking a step back, I understand the initial concern was that a table >>> name (e.g., t1 in your example) would be referenced multiple times in the >>> view definition and each reference is associated with a different snapshot >>> ID, hence UUID is not sufficient to capture each occurrence/reference. I >>> proposed: >>> >>> * The solution to track unique occurrences is to use something along the >>> lines of the SQL alias (e.g., "t1" for the first occurrence and "t2" from >>> "as t2" in your example) to uniquely identify each occurrence -- we can >>> tweak the representation and explore how to handle this in case of nested >>> queries, etc, but alias is the main concept to track uniqueness. >>> * However, since this leads to a series of open ended problems, I have >>> also suggested avoiding this complexity altogether and not supporting time >>> travel in MVs for now. >>> >>> However, thinking again, are not time travel queries in MVs >>> self-containing the exact snapshot ID that we are trying to track in the >>> properties? Looks like this information is already encoded in the query and >>> there is no need to capture it externally. >>> >>> For example, if the MV definition consists of table references where all >>> of the references are bound to specific snapshot IDs or timestamps, then >>> the storage table is always fresh no matter if the underlying tables >>> change. Tracking snapshot IDs in the storage table is only required for >>> table references that are not pinned to a specific snapshot ID/timestamp in >>> the view definition, for which UUID is sufficient. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Fri, May 17, 2024 at 9:51 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Hi everyone, just want to say I am back from the leave, and currently >>>> catching up with the threads, I will make more comments later after knowing >>>> more details of what has been going on. Looks like we've made great >>>> progress! >>>> >>>> Just my 2 cents on the current properties vs metadata field discussion. >>>> The proposed properties are: >>>> - in view: >>>> 1. a boolean flag to indicate a view is a MV >>>> 2. a pointer to the storage table >>>> - in storage table: >>>> 3. view version that is materialized >>>> 4. a prefix-based map to describe the snapshot version of the base >>>> tables that are materialized >>>> 5. a prefix-based map to describe the version of child views that are >>>> materialized >>>> >>>> For 1, 2, and 3, these are all pretty simple and can be just >>>> properties. I guess 4 and 5 are the main ones that seem complex and can be >>>> more formalized as metadata fields. I think the time travel cases Bunny >>>> brought up might be good ones to look into more details: >>>> >>>> For direct version travel, I think the base table version serves as the >>>> default. If you have a MV query like >>>> >>>> SELECT * FROM >>>> t1, >>>> t1 FOR SYSTEM_VERSION AS OF 987654 as t2 >>>> WHERE t1.c1 = t2.c1 >>>> >>>> and in the storage table it says t1 maps to snapshot id 123456, then >>>> the query is still not ambiguous, it should be interpreted as >>>> >>>> SELECT * FROM >>>> t1 FOR SYSTEM_VERSION AS OF 123456, >>>> t1 FOR SYSTEM_VERSION AS OF 987654 as t2 >>>> WHERE t1.c1 = t2.c1 >>>> >>>> For ref travel, the specific ref version needs to be fixed at MV >>>> creation time: >>>> >>>> SELECT * FROM >>>> t1, >>>> t1 FOR SYSTEM_VERSION AS OF '2024-Q1' as t2 >>>> WHERE t1.c1 = t2.c1 >>>> >>>> Just storing table UUID is not sufficient. In a property-based >>>> approach, we need something like >>>> base.table.<table>.ref.<ref-name>=<snapshot-id>. >>>> >>>> Time travel is similar to ref travel: >>>> >>>> SELECT * FROM >>>> t1, >>>> t1 FOR SYSTEM_TIME AS OF timestamp '2024-01-01' as t2 >>>> WHERE t1.c1 = t2.c1 >>>> >>>> In a property-based approach, we need something like >>>> base.table.<table>.time.<timestamp>=<snapshot-id>. >>>> >>>> Technically this is indeed getting increasingly complex, so I can get >>>> why many of us say this property-based approach is quite brittle. However, >>>> it seems like it can still work as we extend the property structure. >>>> Personally speaking I am leaning more towards the property-based approach >>>> just for its simplicity, but I need to think more about other use cases as >>>> well. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> On Thu, May 16, 2024 at 10:21 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> I think this is orthogonal to the property vs metadata field since >>>>> instead of representing the property as `base.table.[UUID]` it would be >>>>> something like `base.table.[alias]` where `alias` is the specific >>>>> occurrence of the table in the query according to its alias (and SELECT >>>>> scope possibly, which kind of opens the door to further complexities, but >>>>> for the sake of argument -- there is a mapping to properties too). >>>>> >>>>> Another question: assuming we go with the top level metadata model, >>>>> will we still expose this metadata on the engine side as properties? What >>>>> would the property names be? >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Thu, May 16, 2024 at 9:55 PM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> Sounds good. >>>>>> >>>>>> Another benefit of the struct model is that it's more extensible in >>>>>> the future when we need to disambiguate the same table that appears >>>>>> multiple times in the MV query tree. >>>>>> This could happen with time travel queries or branching. We may end >>>>>> up adding additional properties like a sequence number, parent view or >>>>>> path >>>>>> to root. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Thu, May 16, 2024 at 3:57 PM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Hi Benny, I have responded to the comment. >>>>>>> >>>>>>> I would suggest that we use this thread to evaluate properties model >>>>>>> vs top level metadata model (to avoid discussion drift). >>>>>>> >>>>>>> If we have feedback on the actual properties used in the properties >>>>>>> model as defined in the PR, we can have the discussion there. >>>>>>> >>>>>>> THanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Thu, May 16, 2024 at 3:22 PM Benny Chow <btc...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Walaa >>>>>>>> >>>>>>>> I left comments in your spec PR: >>>>>>>> https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169 >>>>>>>> My last question about use cases was really about incremental refresh >>>>>>>> with >>>>>>>> aggregates. But I think this might be too complicated to try to >>>>>>>> model/discuss now and so I agree with Micah's comment about doing it >>>>>>>> in a >>>>>>>> future iteration. >>>>>>>> >>>>>>>> Hi Jan, >>>>>>>> >>>>>>>> Regarding storing the identifiers, I like the idea too. Dremio's >>>>>>>> query engine supports MVs on sources besides Iceberg tables. Here's >>>>>>>> everything that's in a single lineage entry: >>>>>>>> https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80 >>>>>>>> The lineage is stored as a graph and not a list of entries. I think >>>>>>>> for >>>>>>>> what we are trying to achieve, it's more practical to limit the >>>>>>>> lineage to >>>>>>>> Iceberg sources. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Benny >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 15, 2024 at 12:06 AM Jan Kaul >>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>> >>>>>>>>> I agree with Szehon and Benny that storing the lineage information >>>>>>>>> as multiple table properties is too brittle, especially for many >>>>>>>>> source >>>>>>>>> tables (base tables). I would prefer to have the whole lineage >>>>>>>>> information >>>>>>>>> in one entry as it is more concise. This is also how Trino has been >>>>>>>>> doing >>>>>>>>> it, as you can see here >>>>>>>>> <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915> >>>>>>>>> . >>>>>>>>> >>>>>>>>> As we've discussed in the google doc >>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>: >>>>>>>>> it is helpful to also store the table identifiers of the source >>>>>>>>> tables to >>>>>>>>> enable clients to determine the freshness of the MV that don't >>>>>>>>> understand >>>>>>>>> the SQL dialect of the MV definition, like other query engines, BI >>>>>>>>> tools >>>>>>>>> and Dataframe libraries. This is also how Trino is doing it. That's >>>>>>>>> why we >>>>>>>>> chose the design in the google doc. >>>>>>>>> >>>>>>>>> Storing the storage table identifier as a property might work. >>>>>>>>> >>>>>>>>> Thanks, Jan >>>>>>>>> On 15.05.24 02:38, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>>>>>> Thanks Benny. My specific thoughts about the spec and the >>>>>>>>> properties are captured in the spec PR >>>>>>>>> https://github.com/apache/iceberg/pull/10280. The spec is also >>>>>>>>> implemented in the Spark implementation PR >>>>>>>>> https://github.com/apache/iceberg/pull/9830, and I believe this >>>>>>>>> follows the same nature of how the information was captured in >>>>>>>>> Netflix's >>>>>>>>> implementation with Spark, and Trino implementation (prior to >>>>>>>>> formalizing >>>>>>>>> through that spec), both of which have been used reliably for years. I >>>>>>>>> think it also aligns with Ryan's feedback here >>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 >>>>>>>>> which >>>>>>>>> indicated the usage of properties. >>>>>>>>> >>>>>>>>> The reasons for choosing properties: >>>>>>>>> * Not every table is a storage table and not every view is a >>>>>>>>> materialized view. I feel exposing the info as top level metadata is >>>>>>>>> an >>>>>>>>> overkill for the original object type. >>>>>>>>> * The properties are simple. They contain either single snapshot >>>>>>>>> ID each, or single view version each, or lastly, the storage table >>>>>>>>> identifier. Engines can use them without issues (also as shown in the >>>>>>>>> implementation). >>>>>>>>> * To be meaningful, the metadata fields should be captured in the >>>>>>>>> engine API as well, which is an effort that has to be pursued outside >>>>>>>>> the >>>>>>>>> Iceberg community. Until engine APIs evolve, we would have to define a >>>>>>>>> mapping between Iceberg metadata fields and engine properties (only >>>>>>>>> current >>>>>>>>> place in engine side to capture this info). This requires an >>>>>>>>> additional >>>>>>>>> spec on its own, and it will introduce complexities. Hence it is >>>>>>>>> always >>>>>>>>> cleaner to map Iceberg properties to engine properties and Iceberg >>>>>>>>> metadata >>>>>>>>> to designated engine APIs. Mixing and matching (e.g., Iceberg metadata >>>>>>>>> fields as engine properties) just for the lack of other cleaner >>>>>>>>> options >>>>>>>>> does not sound like a good idea in both short and long term. >>>>>>>>> >>>>>>>>> Let me know your thoughts. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I agree with Szheon here. I think storing the materialization >>>>>>>>>> lineage as a bunch of properties is brittle. This lineage >>>>>>>>>> information is >>>>>>>>>> needed by engines to validate the staleness of a materialization and >>>>>>>>>> also >>>>>>>>>> to perform full or incremental refreshes. There’s a lot to capture >>>>>>>>>> here. >>>>>>>>>> >>>>>>>>>> Maybe we should drill down into the use cases first - such as >>>>>>>>>> incremental refresh with aggregates? (Pick a harder one first 😀) >>>>>>>>>> >>>>>>>>>> I left a few comments about this in the doc. I wonder what are >>>>>>>>>> your thoughts here Walaa? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa < >>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks John. The current metadata does not sound complex. We need >>>>>>>>>> to track the underlying table snapshot IDs as well as the view >>>>>>>>>> version ID. >>>>>>>>>> I agree as long as it is simple and before this feature fully >>>>>>>>>> matures, we >>>>>>>>>> should track it in properties. >>>>>>>>>> >>>>>>>>>> One important factor for me (apart from the API effort, >>>>>>>>>> especially on the engine side), is that not each table is an MV >>>>>>>>>> storage >>>>>>>>>> table. Surfacing MV-specific storage table properties as first class >>>>>>>>>> table >>>>>>>>>> metadata sounds to impose this metadata on every table, when it is >>>>>>>>>> not >>>>>>>>>> required for normal table operation (yes, they can be optional, but >>>>>>>>>> it does >>>>>>>>>> not sound like the use case warrants exposing them as metadata >>>>>>>>>> fields yet). >>>>>>>>>> >>>>>>>>>> Similarly, since not every view is a materialized view, it sounds >>>>>>>>>> reasonable to keep MV-specific data in properties. >>>>>>>>>> >>>>>>>>>> UUID (for views), on the other hand, is common to all views, >>>>>>>>>> hence it made sense to add it as a top level field. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Szheon, >>>>>>>>>>> >>>>>>>>>>> While I fully share your concern of abusing table properties, we >>>>>>>>>>> took the approach of option 1 and run it in production for several >>>>>>>>>>> years: >>>>>>>>>>> >>>>>>>>>>> - the feature was still evolving >>>>>>>>>>> - quick and simple implementation >>>>>>>>>>> - table properties are simple enough and not confusing >>>>>>>>>>> - haven't seen an urgent need to convert the properties to >>>>>>>>>>> metadata fields and add API >>>>>>>>>>> - do not wish on-disk changes (requiring lengthy >>>>>>>>>>> tedious migration) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That said, I am open to codifying the mv metadata into api and >>>>>>>>>>> spec, with the following considerations >>>>>>>>>>> >>>>>>>>>>> - mv metadata has reached maturity and consensus (could be >>>>>>>>>>> just a core portion) >>>>>>>>>>> - when mv metadata becomes too complex >>>>>>>>>>> - wish to support use cases that are quicker to adopt API >>>>>>>>>>> changes (than engines), e.g., using Iceberg library to >>>>>>>>>>> manipulate MVs, or >>>>>>>>>>> parsing metadata files directly >>>>>>>>>>> - Spark view catalog API can evolve separately from Iceberg >>>>>>>>>>> API and spec changes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks all for the great discussion! >>>>>>>>>>> >>>>>>>>>>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa < >>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Szheon, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the follow-up. It is possible some of the concerns >>>>>>>>>>>> were referring to the backend catalogs, but it is all connected. >>>>>>>>>>>> My main >>>>>>>>>>>> personal concern is from the engine connector APIs point of view, >>>>>>>>>>>> but I >>>>>>>>>>>> share the concern about the catalogs. >>>>>>>>>>>> >>>>>>>>>>>> I think everyone's concern is not about the complexity* per* >>>>>>>>>>>> backend catalog/engine catalog API (in which case adding new >>>>>>>>>>>> metadata to >>>>>>>>>>>> existing objects could require less code), but rather about the >>>>>>>>>>>> *number* of APIs and implementations that need to change (in >>>>>>>>>>>> which case both new metadata to existing objects and new objects >>>>>>>>>>>> altogether >>>>>>>>>>>> introduce equal complexity). >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Walaa. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho < >>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Walaa >>>>>>>>>>>>> >>>>>>>>>>>>> OK thanks for confirming. I am still not 100% in agreement, >>>>>>>>>>>>> my understanding of the rationale for separate Table/View objects >>>>>>>>>>>>> in the >>>>>>>>>>>>> comment that you linked: >>>>>>>>>>>>> >>>>>>>>>>>>> I think the biggest problem with this is that we would need to >>>>>>>>>>>>>> modify every catalog to support this combination and that would >>>>>>>>>>>>>> be really >>>>>>>>>>>>>> difficult. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> is about JavaCatalogs /REST Catalog needing to to support >>>>>>>>>>>>> creating , persisting, and loading a MaterializedView object, >>>>>>>>>>>>> which is much >>>>>>>>>>>>> more complex. See HiveView PR for example : >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9852 We would have >>>>>>>>>>>>> to do the same exercise for persisting MV. >>>>>>>>>>>>> >>>>>>>>>>>>> In our case though, there's not much complexity regardless of >>>>>>>>>>>>> approach ('properties' or new metadata fields), in terms of Java >>>>>>>>>>>>> Catalog/REST Catalog. It's mostly pass-through to storage. >>>>>>>>>>>>> Looks like you >>>>>>>>>>>>> are referring to Spark's View model in terms of complexity, which >>>>>>>>>>>>> may be a >>>>>>>>>>>>> different story, but not sure if it is a good rationale to make >>>>>>>>>>>>> Iceberg to >>>>>>>>>>>>> use 'properties' . >>>>>>>>>>>>> >>>>>>>>>>>>> 'properties' is for read/write configurations, not to save >>>>>>>>>>>>> metadatas. To me, its also brittle to save important metadata, >>>>>>>>>>>>> as it's not >>>>>>>>>>>>> in the defined schema. >>>>>>>>>>>>> >>>>>>>>>>>>> A string to string map of table properties. This is used to >>>>>>>>>>>>>> control settings that affect reading and writing and is not >>>>>>>>>>>>>> intended to be >>>>>>>>>>>>>> used for arbitrary metadata. For example, >>>>>>>>>>>>>> commit.retry.num-retries is used to control the number of >>>>>>>>>>>>>> commit retries. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On the other hand, the Draft Spec suggests to save `lineage` >>>>>>>>>>>>> as a modeled field on the Storage Table's snapshot metadata. >>>>>>>>>>>>> This allows >>>>>>>>>>>>> you to 'time travel', 'branch', and have this metadata life cycle >>>>>>>>>>>>> integrated via normal snapshots lifecycle operations. >>>>>>>>>>>>> >>>>>>>>>>>>> So that's my rationale. Not sure if we can come to an >>>>>>>>>>>>> agreement over email though, and may need others to chime in as >>>>>>>>>>>>> well. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Szehon >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa < >>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Szehon, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, you are reading the PR correctly, and interpreting the >>>>>>>>>>>>>> meaning of properties correctly. I think the reply you pasted >>>>>>>>>>>>>> from Ryan >>>>>>>>>>>>>> refers to the same concept as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For the initial Google doc and the issue (by the way it is an >>>>>>>>>>>>>> issue, not a PR), yes both are proposing new metadata fields. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The references I made to the modeling doc [1, 2] are reasons >>>>>>>>>>>>>> why new APIs are not desired. The cons/concerns applicable to >>>>>>>>>>>>>> new MV >>>>>>>>>>>>>> metadata apply by extension to new table and view metadata >>>>>>>>>>>>>> fields. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The reason why new metadata adds complexity is that this new >>>>>>>>>>>>>> metadata needs to be propagated to the engine API. For example, >>>>>>>>>>>>>> here is the >>>>>>>>>>>>>> ViewInfo [3] class in the Spark catalog, which is used in view >>>>>>>>>>>>>> methods like >>>>>>>>>>>>>> createView. Its fields correspond with the Iceberg metadata. >>>>>>>>>>>>>> Adding new >>>>>>>>>>>>>> Iceberg fields should be accompanied with new fields in the >>>>>>>>>>>>>> engine >>>>>>>>>>>>>> catalog/connector APIs, which was a major reason for rejecting >>>>>>>>>>>>>> the combined >>>>>>>>>>>>>> MV object model as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>>>>>>>>>>>>> [3] >>>>>>>>>>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho < >>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Walaa >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As there may be confusion in the word 'properties', I want >>>>>>>>>>>>>>> to double check if we are talking about the same thing here. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am reading your PR as adding lineage metadata as new >>>>>>>>>>>>>>> key/value pair under the storage Table's 'properties' field: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *optional* *optional* *properties* A string to string map >>>>>>>>>>>>>>> of table properties. This is used to control settings that >>>>>>>>>>>>>>> affect reading >>>>>>>>>>>>>>> and writing and is not intended to be used for arbitrary >>>>>>>>>>>>>>> metadata. For >>>>>>>>>>>>>>> example, commit.retry.num-retries is used to control the >>>>>>>>>>>>>>> number of commit retries. >>>>>>>>>>>>>>> and adding Storage Table pointer as key/value pair in the >>>>>>>>>>>>>>> View's 'properties' field: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *optional* properties A string to string map of view >>>>>>>>>>>>>>> properties [2] >>>>>>>>>>>>>>> Is that correct? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On the other hand, I was talking about adding this metadata >>>>>>>>>>>>>>> as actual fields, as is described in the Draft Spec of the >>>>>>>>>>>>>>> Design Doc >>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> first PR https://github.com/apache/iceberg/issues/6420 . >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you mean, the vote means we cannot model new fields like >>>>>>>>>>>>>>> 'materialization' and 'lineage' as was proposed there ? If >>>>>>>>>>>>>>> that is the >>>>>>>>>>>>>>> interpretation, I am not sure I agree. I dont fully see how >>>>>>>>>>>>>>> new fields >>>>>>>>>>>>>>> adds more catalog implementation complexity over new key/value >>>>>>>>>>>>>>> properties? >>>>>>>>>>>>>>> To me, the vote seemed to just rule out using a combined >>>>>>>>>>>>>>> catalog object >>>>>>>>>>>>>>> (MaterializedView) in favor of re-using the Table and View >>>>>>>>>>>>>>> metadata models, >>>>>>>>>>>>>>> not to prevent change to the Table and View model. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Szehon, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think choosing separate view + table objects precludes us >>>>>>>>>>>>>>>> from adding new metadata to table and view metadata. Here is >>>>>>>>>>>>>>>> one relevant >>>>>>>>>>>>>>>> comment [1] from Ryan on the modeling doc, where his point is >>>>>>>>>>>>>>>> that we want >>>>>>>>>>>>>>>> to avoid introducing new APIs since it requires updating every >>>>>>>>>>>>>>>> catalog, and >>>>>>>>>>>>>>>> (quoting) even now, we have few implementations that support >>>>>>>>>>>>>>>> views because >>>>>>>>>>>>>>>> of the problems updating back ends. Therefore, one of the >>>>>>>>>>>>>>>> major reasons to >>>>>>>>>>>>>>>> avoid a new model with new metadata is to avoid adding new >>>>>>>>>>>>>>>> metadata, which >>>>>>>>>>>>>>>> introduces this complexity. Here is another similar comment >>>>>>>>>>>>>>>> from Renjie [2] >>>>>>>>>>>>>>>> on the cons listed for the combined object approach. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Even Ryan's point on the MV issue that you referenced reads >>>>>>>>>>>>>>>> to me as he is supportive of the property model. Here are some >>>>>>>>>>>>>>>> quotes: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > We would still want some MV metadata in table >>>>>>>>>>>>>>>> *properties*. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > I recommend instead reusing the existing snapshot >>>>>>>>>>>>>>>> metadata structure to store what you need as snapshot >>>>>>>>>>>>>>>> *properties*. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > First, I think we want to avoid keeping much state >>>>>>>>>>>>>>>> information in complex table *properties*. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Again, here, he is supportive of table properties, but >>>>>>>>>>>>>>>> wants to make sure that the information is simple. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > We may want additional metadata as well, like a UUID to >>>>>>>>>>>>>>>> ensure we have the right view. I don't think we have a UUID in >>>>>>>>>>>>>>>> the view >>>>>>>>>>>>>>>> spec yet, but we could add one. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here, he is very specific when it comes to new metadata >>>>>>>>>>>>>>>> fields, and explicitly calls it out. That is the only new >>>>>>>>>>>>>>>> metadata field in >>>>>>>>>>>>>>>> that reply and by now it is already supported. It is also not >>>>>>>>>>>>>>>> MV-specific. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hope this addresses your question on the property vs new >>>>>>>>>>>>>>>> metadata model. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 >>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho < >>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Walaa, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I agree, I definitely do not want yet another pr/doc where >>>>>>>>>>>>>>>>> discussion happens. as its already quite spread out :) But >>>>>>>>>>>>>>>>> did not want to >>>>>>>>>>>>>>>>> clarify some points before we get started on the discussion >>>>>>>>>>>>>>>>> on your PR. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> With reusing the table and view objects, we are not >>>>>>>>>>>>>>>>>> changing the existing metadata of either table or view spec >>>>>>>>>>>>>>>>>> but rather >>>>>>>>>>>>>>>>>> introduce new properties and formalize them to express >>>>>>>>>>>>>>>>>> materialized views >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On this point, I am not 100% sure that choosing to >>>>>>>>>>>>>>>>> represent a MaterializedView as a separate View + Table >>>>>>>>>>>>>>>>> object precludes us >>>>>>>>>>>>>>>>> from adding to metadata of Table or View as the Draft Spec >>>>>>>>>>>>>>>>> suggested, >>>>>>>>>>>>>>>>> though. I think this point was discussed in Jan's initial PR >>>>>>>>>>>>>>>>> with a good >>>>>>>>>>>>>>>>> point from Ryan: >>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546 >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> using Table Properties to track lineage is fairly brittle, >>>>>>>>>>>>>>>>> and having it >>>>>>>>>>>>>>>>> formalized in the Iceberg metadata is cleaner, and that was >>>>>>>>>>>>>>>>> thus the >>>>>>>>>>>>>>>>> direction of the Draft Spec in the design doc. What do >>>>>>>>>>>>>>>>> people think? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks Szehon. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The reason for the difference is that the proposal in the >>>>>>>>>>>>>>>>>> Google doc is based on a new MV model, hence, new metadata >>>>>>>>>>>>>>>>>> fields and a new >>>>>>>>>>>>>>>>>> metadata model were being introduced (with types, >>>>>>>>>>>>>>>>>> optionality, etc). With >>>>>>>>>>>>>>>>>> reusing the table and view objects, we are not changing the >>>>>>>>>>>>>>>>>> existing >>>>>>>>>>>>>>>>>> metadata of either table or view spec but rather introduce >>>>>>>>>>>>>>>>>> new properties >>>>>>>>>>>>>>>>>> and formalize them to express materialized views. This would >>>>>>>>>>>>>>>>>> be the answer >>>>>>>>>>>>>>>>>> to most of the questions you posted on the PR (besides some >>>>>>>>>>>>>>>>>> naming >>>>>>>>>>>>>>>>>> questions, which I think should be straightforward). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> With that fundamental difference, we cannot lift and >>>>>>>>>>>>>>>>>> shift what is in the doc to any PR. Further, having >>>>>>>>>>>>>>>>>> consensus on separate >>>>>>>>>>>>>>>>>> table and view objects contradicts with the point being made >>>>>>>>>>>>>>>>>> on having >>>>>>>>>>>>>>>>>> consensus on the doc. We might have had agreements on some >>>>>>>>>>>>>>>>>> elements, but >>>>>>>>>>>>>>>>>> definitely not on the whole doc, proven by the follow ups >>>>>>>>>>>>>>>>>> (also as a >>>>>>>>>>>>>>>>>> community, not individuals). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Therefore: we need a new space to discuss the separate >>>>>>>>>>>>>>>>>> table and view properties. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is the question whether to: >>>>>>>>>>>>>>>>>> 1- Create a new doc >>>>>>>>>>>>>>>>>> 2- Create a new PR? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I feel a PR is the most effective way, especially given >>>>>>>>>>>>>>>>>> the fact that we discussed the topic a lot by now. If we >>>>>>>>>>>>>>>>>> agree, we can >>>>>>>>>>>>>>>>>> continue the discussion on the PR, else, we can create a doc. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho < >>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Walaa for driving it forward, looking forward to >>>>>>>>>>>>>>>>>>> thinking about implementation of Materialized Views. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I see Jan's point, the PR spec change is similar but >>>>>>>>>>>>>>>>>>> does not seem to be completely aligned with the Draft Spec >>>>>>>>>>>>>>>>>>> in the design >>>>>>>>>>>>>>>>>>> doc: >>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/ >>>>>>>>>>>>>>>>>>> . I left my comments on PR of those sections with the >>>>>>>>>>>>>>>>>>> links to the >>>>>>>>>>>>>>>>>>> difference. I think most of those Draft Spec proposal is >>>>>>>>>>>>>>>>>>> still applicable >>>>>>>>>>>>>>>>>>> after the decision to have separate Table and View objects >>>>>>>>>>>>>>>>>>> It will be >>>>>>>>>>>>>>>>>>> interesting to at least see drill a bit further why we did >>>>>>>>>>>>>>>>>>> not choose the >>>>>>>>>>>>>>>>>>> approach in the Draft Spec and chose another way. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Well, everybody that actively contributed to the >>>>>>>>>>>>>>>>>>>> discussion on the original google doc was in consensus. >>>>>>>>>>>>>>>>>>>> That's why I >>>>>>>>>>>>>>>>>>>> brought up the topic at the Community Sync on the >>>>>>>>>>>>>>>>>>>> 2024-02-14 ( >>>>>>>>>>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the >>>>>>>>>>>>>>>>>>>> awareness of the broader community. After which the >>>>>>>>>>>>>>>>>>>> discussion about the >>>>>>>>>>>>>>>>>>>> storage model started. I don't think that the discussion >>>>>>>>>>>>>>>>>>>> about a single >>>>>>>>>>>>>>>>>>>> aspect of a proposal should invalidate all other aspects >>>>>>>>>>>>>>>>>>>> of the proposal. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Regardless, the state of the proposal from the original >>>>>>>>>>>>>>>>>>>> google doc contains a lot of valuable contributions from >>>>>>>>>>>>>>>>>>>> Micah, Szehon, >>>>>>>>>>>>>>>>>>>> Jack, Dan, yourself and others and it should at least >>>>>>>>>>>>>>>>>>>> provide the basis for >>>>>>>>>>>>>>>>>>>> any further discussion. I don't think it's effective to >>>>>>>>>>>>>>>>>>>> start with a >>>>>>>>>>>>>>>>>>>> completely different design because we are bound to have >>>>>>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>>>>>> discussions all over again. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, Jan >>>>>>>>>>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The only consensus the community had was on the object >>>>>>>>>>>>>>>>>>>> model through the most recent voting thread [1]. This kind >>>>>>>>>>>>>>>>>>>> of consensus was >>>>>>>>>>>>>>>>>>>> not present during the doc discussions, and this should be >>>>>>>>>>>>>>>>>>>> evident from the >>>>>>>>>>>>>>>>>>>> fact the last doc state listed 5 alternatives with no >>>>>>>>>>>>>>>>>>>> particular >>>>>>>>>>>>>>>>>>>> conclusion. I am not quite sure what type of consensus we >>>>>>>>>>>>>>>>>>>> are referring to >>>>>>>>>>>>>>>>>>>> here given all the follow up discussions, alternatives, >>>>>>>>>>>>>>>>>>>> etc. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Due to the separate object model, the PR is >>>>>>>>>>>>>>>>>>>> fundamentally different from the doc in the sense it does >>>>>>>>>>>>>>>>>>>> not propose a new >>>>>>>>>>>>>>>>>>>> metadata model but rather formalizes some new table and >>>>>>>>>>>>>>>>>>>> view properties >>>>>>>>>>>>>>>>>>>> related to MVs. That is also one reason there are no >>>>>>>>>>>>>>>>>>>> repeated discussions. >>>>>>>>>>>>>>>>>>>> That said, if you feel there is a repeated discussion >>>>>>>>>>>>>>>>>>>> (which I do not see >>>>>>>>>>>>>>>>>>>> so far), it would be best to link the relevant discussion >>>>>>>>>>>>>>>>>>>> from the doc in a >>>>>>>>>>>>>>>>>>>> comment. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Happy to move the discussion elsewhere if there is >>>>>>>>>>>>>>>>>>>> sufficient support for this idea, but as things stand, I >>>>>>>>>>>>>>>>>>>> do not see this as >>>>>>>>>>>>>>>>>>>> an efficient way to make progress. It sounds we have been >>>>>>>>>>>>>>>>>>>> re-emphasizing >>>>>>>>>>>>>>>>>>>> the same points in the last two replies, so I will let >>>>>>>>>>>>>>>>>>>> others chime in at >>>>>>>>>>>>>>>>>>>> this point. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul >>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The original google doc >>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing> >>>>>>>>>>>>>>>>>>>>> discussed multiple >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>