Let me recap and see if we are on the same page. 1) we have some consensus on the refresh-state on the storage table. it would contain these fields: UUID, snapshot-id (for table) or version-id (for view), namespace, table. 2) there is no consensus if lineage info is needed in the view definition. we want to defer/postpone that decision.
I have a little concern on postponing the decision on the lineage info. It is not a blocker to move forward. But if later we decided to add it to the spec, it poses new requirements for the writer/reader. That would be a breaking change. On Tue, Oct 1, 2024 at 8:03 PM Benny Chow <btc...@gmail.com> wrote: > Hi Jan > > Both sound good to me. (No lineage in views and assumption about UUIDs > being unique across catalogs). I hope we get to voting soon on your PR.. > > Thanks > Benny > > On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul <jank...@mailbox.org.invalid> > wrote: > >> Hi Benny, >> >> thanks for bringing up the UUID issue. It is my understanding that UUIDs >> are designed to be used in distributed systems without the need for a >> central coordination process. This is in line with the description from the >> RFC4122 >> document <https://datatracker.ietf.org/doc/html/rfc4122> where it says: >> "One of the main reasons for using UUIDs is that no centralized authority >> is required to administer them". >> >> Wikipedia <https://en.wikipedia.org/wiki/Universally_unique_identifier> >> says: >> >> "While the probability that a UUID will be duplicated is not zero, it is >> generally considered close enough to zero to be negligible" >> >> Consequently, I would argue that it is a reasonable assumption that UUIDs >> are unique across catalogs. >> >> >> Regarding the identifier/catalog-alias problem: As we can fallback to SQL >> parsing and don't require the lineage, I would propose to move ahead >> without the lineage. Especially as this seems to be a problem with the View >> Spec that we can't solve now. If there is a demand to add the lineage in >> the future, once the catalog-alias problem has been solved, we can still >> add it then. >> >> Let me know your thoughts. >> >> Best wishes, >> >> Jan >> On 28.09.24 07:45, Benny Chow wrote: >> >> >> storing the lineage is an optimization that can avoid >> recomputation/re-parsing. >> I don't think having the lineage is optimizing much over re-parsing the >> SQL. The most expensive part of SQL parsing is catalog access which has to >> happen with lineage anyway. Once the planner has the query tree, it can >> validate the freshness. It's not like the planner needs to complete >> logical and physical planning. >> >> >> We could also have the catalog name/alias problem for the same engine. >> Yes, this is a general problem with the Iceberg View spec. I guess if >> two different Spark clusters wanted to share the same view, they best not >> reference the catalog name in their SQLs. Even then, cross catalog joins >> are not going to work. Again, these are problems with the Iceberg View >> spec. I think for the MV spec, as long as we don't propose something that >> involves SQL identifiers, then the MV spec isn't making this different >> engine problem worse. >> >> There's another issue I'd like to bring up about using UUIDs which is >> that these UUIDs are client generated and there's no validation that they >> are indeed globally unique identifiers. The catalog just persists whatever >> it is given without validating that the UUIDs are indeed UUIDs and unique >> across the catalog. (I know Nessie is not doing this validation). We are >> assuming this UUID is not only unique within a catalog but is also unique >> across catalogs. Thoughts on this? >> >> Thanks >> Benny >> >> >> >> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> I agree that it is reasonable to assume/restrict view definition and >>> storage table in the same catalog. Hence the storage table reference in the >>> view metadata can include only namespace and table (excluding the engine >>> dependent catalog name/alias). >>> >>> Regarding the question of having lineage metadata in view definition vs >>> re-parsing SQL, I guess storing the lineage is an optimization that can >>> avoid recomputation/re-parsing. would be good to have more input. >>> >>> Thinking about catalog name/alias again. For the same engine (like >>> Spark), different applications/jobs may configure the catalog name >>> differently. E.g. Spark catalogs are configured using properties under >>> spark.sql.catalog.(catalog_name). We could also have the catalog >>> name/alias problem for the same engine. >>> >>> >>> >>> >>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul <jank...@mailbox.org.invalid> >>> <jank...@mailbox.org.invalid> wrote: >>> >>>> Hi Walaa, >>>> >>>> It appears that you would like to maintain the lineage structure and >>>> not revert to parsing the SQL to obtain identifiers. >>>> >>>> Initially, one of the reasons for avoiding SQL parsing was to enable >>>> consumers who don't understand the SQL dialect of any representation to >>>> determine the freshness of the Materialized View (MV). However, with the >>>> "catalog alias" issue, having an identifier for some representation is >>>> insufficient, as the *catalog_name* is unlikely to work for the >>>> consumer. Therefore, supporting consumers that don't use a query engine of >>>> any representation seems impossible. >>>> >>>> Given this, parsing the SQL definition becomes a less significant >>>> drawback, as the consumer must understand the dialect anyway. In fact, >>>> simply parsing the SQL definition seems like a more robust and >>>> straightforward solution than using a lineage for every representation. I >>>> believe this is why Benny suggested reverting to SQL parsing, and I agree >>>> with him. >>>> >>>> Regarding the Storage table identifier: Its design as a >>>> *PartialIdentifier* with only namespace and name fields was >>>> intentional, to avoid the *catalog_name* issue. >>>> >>>> Best regards, >>>> >>>> Jan >>>> On 19.09.24 23:16, Benny Chow wrote: >>>> >>>> If Spark added the storage table identifier to the MV, I'm not sure how >>>> it could also add a full identifier to the Dremio representation. >>>> Spark doesn't know what name Dremio used for the catalog. >>>> >>>> For the UX issue, I think Jan cleverly called it a "PartialIdentifier" >>>> and not a "FullIdentifier" to indicate that catalog name is not even a >>>> property of the identifier. >>>> >>>> Requirement 3 is for the view's SQL. I'm not sure there is a very >>>> strong use case to put the storage table into a different catalog than the >>>> view. If we had an engine agnostic solution for it, I'm all for it >>>> though... >>>> >>>> Thanks >>>> Benny >>>> >>>> >>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> I think the solution for the storage identifier might be shared with >>>>> the end state solution for the lineage. One could imagine a "full >>>>> identifier" can be used for the storage table; however, it is >>>>> "representation"-dependent (i.e., it changes according to >>>>> which representation it is part of, or rather which engine uses it). >>>>> >>>>> Also, are we asking engines (or their Iceberg implementation) to throw >>>>> an exception if the full storage table identifier was provided as part of >>>>> the MV definition? Sounds like a not very ideal UX. Note that it also >>>>> conflicts with the spirit of requirement #3. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> Hi Jan >>>>>> >>>>>> "PartialIdentifier" without the catalog name sounds good to me. The >>>>>> storage table and MV have to be in the same catalog. That would be a >>>>>> good >>>>>> fifth requirement to add to the list. >>>>>> >>>>>> Thanks >>>>>> Benny >>>>>> >>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul >>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>> >>>>>>> Cool, I guess it's easier to resolve these kind of things when >>>>>>> talking in person. >>>>>>> >>>>>>> I agree with your requirements and the conclusion to use a map from >>>>>>> UUID to snapshot-id/version-id as the refresh-state, as well as dropping >>>>>>> the lineage in favor to just re-parsing the SQL query. This gets us >>>>>>> around >>>>>>> the "catalog alias" issue. >>>>>>> >>>>>>> And I'm also OK with every engine requiring their own representation >>>>>>> to use the MV. >>>>>>> >>>>>>> There is still the issue with the identifier of the storage table >>>>>>> and its catalog_name. Should we use an "PartialIdentifier" with a >>>>>>> namespace >>>>>>> and a name field, like so: >>>>>>> >>>>>>> { >>>>>>> >>>>>>> namespace: ["bronze"], >>>>>>> >>>>>>> name: "lineitem" >>>>>>> >>>>>>> } >>>>>>> >>>>>>> And require the storage table to be in the same catalog as the MV >>>>>>> itself? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jan >>>>>>> On 19.09.24 00:50, Benny Chow wrote: >>>>>>> >>>>>>> Steven and I met up yesterday at the Seattle Iceberg meetup and we >>>>>>> got to talking about the "catalog alias" issue. He described it as an >>>>>>> annoying problem =p >>>>>>> >>>>>>> I think there are some key requirements we need to support: >>>>>>> >>>>>>> 1. Different engines can produce and consume shared MVs with >>>>>>> freshness validation. >>>>>>> 2. We cannot force different engines to standardize on the alias >>>>>>> they use for the catalog. >>>>>>> 3. We cannot force different SQL representations to exclude catalog >>>>>>> names from table identifiers or not use fully qualified table names. >>>>>>> 4. MV SQL can join tables and views from multiple catalogs -> >>>>>>> Inevitable with Nessie, Polaris, Unity, Tabular and others... >>>>>>> >>>>>>> The producing engine has to save refresh state information to let >>>>>>> consuming engine know that table X is at what snapshot at the time of >>>>>>> materialization. The only way to identify this table across different >>>>>>> catalog names is to use the cross catalog, globally unique UUID. I >>>>>>> think >>>>>>> our only option is to have the refresh state map UUID to snapshot ids >>>>>>> and >>>>>>> view version ids. >>>>>>> >>>>>>> Assuming the above is how we store the refresh state, how does the >>>>>>> consuming engine determine the current snapshot ids? The consuming >>>>>>> engine >>>>>>> will have to fully expand the query tree at which point it will have the >>>>>>> UUIDs as well as the latest snapshot ids/view versions. This can then >>>>>>> be >>>>>>> diffed against the materialization refresh state to determine freshness. >>>>>>> There isn't a need to store the view lineage information to map from >>>>>>> UUID >>>>>>> to the consumer specific identifier so that the consumer can then call >>>>>>> back >>>>>>> into the catalog with that identifier to get the latest state. The >>>>>>> consuming engine might as well just re-parse the SQL and expand the >>>>>>> query. >>>>>>> >>>>>>> Personally, I'm OK with requiring that an engine must have its own >>>>>>> SQL representation in order to use the MV. To me, being able to fulfill >>>>>>> the key requirements above is much more important. >>>>>>> >>>>>>> Thanks >>>>>>> Benny >>>>>>> >>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul >>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>> >>>>>>>> How about we make the *catalog_name field* of the identifier >>>>>>>> optional? If the field is missing, it references a table/view in the >>>>>>>> same >>>>>>>> catalog. If it is present it has to be an engine agnostic catalog name. >>>>>>>> Shouldn't the catalog_names from the REST catalog spec be engine >>>>>>>> agnostic? >>>>>>>> >>>>>>>> I was wondering, is there no way to prescribe a catalog_name in >>>>>>>> Spark or Dremio? What do you do if you include two Nessie catalogs? >>>>>>>> They >>>>>>>> can't both be called LocalNessie. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Jan >>>>>>>> On 14.09.24 01:23, Benny Chow wrote: >>>>>>>> >>>>>>>> The main reason for putting the lineage into the view is so that >>>>>>>> "another" engine can enumerate out the tables in the view without >>>>>>>> needing >>>>>>>> to parse any SQL. But, if we put the lineage under the SQL >>>>>>>> representation >>>>>>>> with engine specific catalog names, the "other" engine is not going to >>>>>>>> be >>>>>>>> able to use those identifiers to look up the tables. The "other" >>>>>>>> engine >>>>>>>> can only lookup those identifiers using its engine specific catalog >>>>>>>> name. >>>>>>>> It may be possible to enumerate the tables at the view version level >>>>>>>> ONLY >>>>>>>> if those identifiers don't include the catalog name. However, if you >>>>>>>> have >>>>>>>> a view with a cross catalog join, then the tables coming from the other >>>>>>>> catalog have to be fully qualified. But then the problem is that each >>>>>>>> engine will also alias the other catalog differently too. >>>>>>>> >>>>>>>> So, I think to summarize *multi-engine* view interoperability: >>>>>>>> >>>>>>>> - default-catalog can't be specified >>>>>>>> - default-namespace can be specified >>>>>>>> - View SQL can only references tables/views from the same >>>>>>>> catalog >>>>>>>> >>>>>>>> I think these are reasonable constraints for multi-engine use >>>>>>>> cases. If reasonable, for MVs, then the storage table, refresh-state >>>>>>>> and >>>>>>>> lineage (at the view version level), could all be based on *engine >>>>>>>> agnostic* identifiers without the catalog name. The MV and >>>>>>>> storage table would have to be in the same catalog. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Benny >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul >>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> regarding our recent discussion on table identifiers with respect >>>>>>>>> to different catalog_names with different query engines. We have the >>>>>>>>> same >>>>>>>>> problem when we want to reference the storage table from the common >>>>>>>>> view. >>>>>>>>> *If we include the catalog_name as part of the identifier, >>>>>>>>> different query engines might not be able to load the storage table. * >>>>>>>>> We could enforce that every storage table has to be part of the >>>>>>>>> same catalog as the main view. This way an identifier without the >>>>>>>>> catalog_name would be enough to point to the correct storage table. >>>>>>>>> >>>>>>>>> What are your thoughts on this? >>>>>>>>> >>>>>>>>> Best wishes, >>>>>>>>> >>>>>>>>> Jan >>>>>>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>>>>>> I think this type of discussion is exactly what motivates a >>>>>>>>> clarification in the view spec so that we can resolve MV lineage. Will >>>>>>>>> create separate thread for view spec clarification. >>>>>>>>> >>>>>>>>> Following up on Jan’s point, yes I agree in order to support >>>>>>>>> catalog name, it should be at the representation level, but catalog >>>>>>>>> name >>>>>>>>> does not really depend on the “dialect” but rather on the “engine”; >>>>>>>>> hence >>>>>>>>> the discussion becomes a little more involved. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi Benny, >>>>>>>>>> >>>>>>>>>> I think that identifiers only being defined for a certain >>>>>>>>>> representation is exactly what we want. Each representation can >>>>>>>>>> define >>>>>>>>>> their own identifiers that then map to an UUID. This way the >>>>>>>>>> "catalog_name" >>>>>>>>>> of the identifier for a "Spark" dialect can be different then for a >>>>>>>>>> "Dremio" dialect. >>>>>>>>>> >>>>>>>>>> The important part is that we still have a list of identifiers >>>>>>>>>> for each representation that we can use with the catalog to obtain >>>>>>>>>> the >>>>>>>>>> state of the source tables. >>>>>>>>>> >>>>>>>>>> Best wishes, >>>>>>>>>> >>>>>>>>>> Jan >>>>>>>>>> On 11.09.24 01:33, Benny Chow wrote: >>>>>>>>>> >>>>>>>>>> Hi Walaa, I don't think the current view spec implicitly assumes >>>>>>>>>> a common catalog name between engines. I tested this by not >>>>>>>>>> specifying the >>>>>>>>>> default-catalog and both engines could look up the correct table >>>>>>>>>> under the >>>>>>>>>> shared default-namespace even when each engine uses a different >>>>>>>>>> catalog >>>>>>>>>> name. >>>>>>>>>> >>>>>>>>>> Hi Jan, I think the issue with putting the lineage as part of the >>>>>>>>>> representation is that that identifier only makes sense for that >>>>>>>>>> representation's engine. In your example, the catalog aliased as >>>>>>>>>> "iceberg" >>>>>>>>>> in spark is going to have a different name in Dremio or Trino. >>>>>>>>>> >>>>>>>>>> IMO, if we are to store a lineage for a view, it should consist >>>>>>>>>> of something engine agnostic like the table/view UUIDs. This would >>>>>>>>>> be >>>>>>>>>> stored at the view version level and not the representation level. >>>>>>>>>> I think >>>>>>>>>> as we get into more of these multi-engine, multi-catalog use cases >>>>>>>>>> for >>>>>>>>>> views, the Iceberg Catalog is going to need to do a better job at >>>>>>>>>> handling >>>>>>>>>> CRUD by UUID instead of engine specific identifiers. Another >>>>>>>>>> scenario we >>>>>>>>>> need to think through is a view that joins tables from two different >>>>>>>>>> catalogs. How would we represent the lineage for that in an engine >>>>>>>>>> agnostic way? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Benny >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul >>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I >>>>>>>>>>> have a better understanding now. Sorry for being a bit stubborn >>>>>>>>>>> before. >>>>>>>>>>> >>>>>>>>>>> Wouldn't it make sense then to store the lineage as part of the >>>>>>>>>>> representation: >>>>>>>>>>> >>>>>>>>>>> { >>>>>>>>>>> >>>>>>>>>>> "type": "sql", >>>>>>>>>>> >>>>>>>>>>> "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>>>>>>>> events\nGROUP BY 2", >>>>>>>>>>> >>>>>>>>>>> "dialect": "spark", >>>>>>>>>>> >>>>>>>>>>> "lineage": [{ >>>>>>>>>>> >>>>>>>>>>> "identifier": { "catalog": "iceberg", "namespace": >>>>>>>>>>> "public", "table": "events"}, >>>>>>>>>>> >>>>>>>>>>> "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385" >>>>>>>>>>> >>>>>>>>>>> }] >>>>>>>>>>> >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Best wishes, >>>>>>>>>>> >>>>>>>>>>> Jan >>>>>>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote: >>>>>>>>>>> >>>>>>>>>>> Benny, thank you so much for performing the experiment. Glad >>>>>>>>>>> that using UUIDs as keys in the state map makes more sense now. >>>>>>>>>>> >>>>>>>>>>> For the issue with the view spec being restrictive, I agree and >>>>>>>>>>> I have raised the concern on the view spec PR last year [1]. I >>>>>>>>>>> think there >>>>>>>>>>> is some area of improvement here. At the least, if it is >>>>>>>>>>> restrictive, it >>>>>>>>>>> should be explicitly stated. I will start a thread on how to >>>>>>>>>>> approach the >>>>>>>>>>> view spec. We may need to get more insight on the view spec before >>>>>>>>>>> finalizing the MV spec, because view spec will determine if we >>>>>>>>>>> should >>>>>>>>>>> proceed with one lineage (with the implicitly assumed common >>>>>>>>>>> catalog name), >>>>>>>>>>> or with multiple lineages (one per engine or catalog name). >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Walaa. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Walaa >>>>>>>>>>>> >>>>>>>>>>>> I did some testing with two different engines (Spark and >>>>>>>>>>>> Dremio) against the same Nessie catalog and created the attached >>>>>>>>>>>> materialized view metadata.json. I see your point now about the >>>>>>>>>>>> SQL >>>>>>>>>>>> identifiers being tightly coupled to the engines. In the metadata >>>>>>>>>>>> JSON, >>>>>>>>>>>> spark refers to the catalog as "SparkNessie", whereas Dremio >>>>>>>>>>>> refers to the >>>>>>>>>>>> catalog as "LocalNessie". So, this means that the fully qualified >>>>>>>>>>>> view and >>>>>>>>>>>> table identifiers are engine specific and Dremio can't lookup a >>>>>>>>>>>> Spark >>>>>>>>>>>> identifier and vice versa. >>>>>>>>>>>> >>>>>>>>>>>> *So, I think it does make sense now for the refresh-state to >>>>>>>>>>>> key off the UUIDs and not use engine specific identifiers. *This >>>>>>>>>>>> also means that the materization consumer will have to fully >>>>>>>>>>>> expand the >>>>>>>>>>>> query tree and basically diff the UUID + latest snapshot ids >>>>>>>>>>>> against the >>>>>>>>>>>> refresh state. Would it ever make sense for the Iceberg Catalog >>>>>>>>>>>> to expose >>>>>>>>>>>> a bulk lookup API by UUID? >>>>>>>>>>>> >>>>>>>>>>>> As a side note, it seems that for a materialized view to work >>>>>>>>>>>> with multiple engines, the default-catalog and default-namespace >>>>>>>>>>>> can't be >>>>>>>>>>>> used unless both engines use the same catalog name which seems >>>>>>>>>>>> pretty >>>>>>>>>>>> restrictive to me. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the great discussions >>>>>>>>>>>> Benny >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa < >>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Jan, we definitely can store SQL identifiers of multiple >>>>>>>>>>>>> representations in Approach 1. >>>>>>>>>>>>> >>>>>>>>>>>>> The takeaway is that SQL identifiers are highly coupled with >>>>>>>>>>>>> engines, just like views. It makes sense to track both together >>>>>>>>>>>>> for >>>>>>>>>>>>> consistency. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul >>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we >>>>>>>>>>>>>> need to keep in mind that we require identifiers to interface >>>>>>>>>>>>>> with the >>>>>>>>>>>>>> catalog. We cannot use UUIDs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for >>>>>>>>>>>>>> your use case because you can't store the catalog names of >>>>>>>>>>>>>> multiple >>>>>>>>>>>>>> representations in the lineage. You would need to fallback to >>>>>>>>>>>>>> parsing the >>>>>>>>>>>>>> SQL for a particular representation and rebuilding the full >>>>>>>>>>>>>> query tree to >>>>>>>>>>>>>> obtain the identifiers. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You could do the same for Approach 2. So I don't see why >>>>>>>>>>>>>> Approach 1 would yield any benefits. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan >>>>>>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Benny, `default-catalog` is optional, while >>>>>>>>>>>>>> `default-namespace` is required. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I will retract my comment on the `summary`. it indicates the >>>>>>>>>>>>>> engine that made the revision to the current view version. it >>>>>>>>>>>>>> doesn't >>>>>>>>>>>>>> really matter for multi-engine/representation support. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope >>>>>>>>>>>>>>> it wouldn't have to be under a specific representation. >>>>>>>>>>>>>>> Walaa - That's a serious concern... If the same catalog is >>>>>>>>>>>>>>> aliased differently by two different engines, then the basic >>>>>>>>>>>>>>> view spec >>>>>>>>>>>>>>> seems broken to me since "default-namespace" includes the >>>>>>>>>>>>>>> catalog alias and >>>>>>>>>>>>>>> is outside of the SQL representation. Does that mean for a >>>>>>>>>>>>>>> view to be >>>>>>>>>>>>>>> interoperable, we require different engines to use the same >>>>>>>>>>>>>>> catalog name? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu < >>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of >>>>>>>>>>>>>>>> multiple representations (for different engines), which >>>>>>>>>>>>>>>> definitely requires >>>>>>>>>>>>>>>> more discussion from the community. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I am looking at the view spec, I am seeing some >>>>>>>>>>>>>>>> conflict. "summary" field seems meant for only one engine, >>>>>>>>>>>>>>>> while >>>>>>>>>>>>>>>> "representations" support multiple engines. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> "summary" : { >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16> >>>>>>>>>>>>>>>> "engine-name" : "Spark", >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17> >>>>>>>>>>>>>>>> "engineVersion" : "3.3.2" >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> >>>>>>>>>>>>>>>> }, >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19> >>>>>>>>>>>>>>>> "representations" : [ { >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> >>>>>>>>>>>>>>>> "type" : "sql", >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21> >>>>>>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>>>>>>>>>>>>> events\nGROUP BY >>>>>>>>>>>>>>>> 2", >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22> >>>>>>>>>>>>>>>> "dialect" : "spark" >>>>>>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> >>>>>>>>>>>>>>>> } ] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With multiple representations/engines, I guess one engine >>>>>>>>>>>>>>>> will be responsible for the storage table refresh and other >>>>>>>>>>>>>>>> engines are >>>>>>>>>>>>>>>> read only. If we want to store the lineage info in the view, >>>>>>>>>>>>>>>> it probably >>>>>>>>>>>>>>>> needs to be part of the "representation" struct so that each >>>>>>>>>>>>>>>> engine/representation stores its own lineage info.. >>>>>>>>>>>>>>>> Who is to validate/ensure that the SQL representation is >>>>>>>>>>>>>>>> actually semantically identical (minus syntax differences >>>>>>>>>>>>>>>> across engines)? >>>>>>>>>>>>>>>> I guess this responsibility is left to the user who owns and >>>>>>>>>>>>>>>> manages the >>>>>>>>>>>>>>>> view. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>