If Spark added the storage table identifier to the MV, I'm not sure how it could also add a full identifier to the Dremio representation. Spark doesn't know what name Dremio used for the catalog.
For the UX issue, I think Jan cleverly called it a "PartialIdentifier" and not a "FullIdentifier" to indicate that catalog name is not even a property of the identifier. Requirement 3 is for the view's SQL. I'm not sure there is a very strong use case to put the storage table into a different catalog than the view. If we had an engine agnostic solution for it, I'm all for it though... Thanks Benny On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > I think the solution for the storage identifier might be shared with the > end state solution for the lineage. One could imagine a "full identifier" > can be used for the storage table; however, it is > "representation"-dependent (i.e., it changes according to > which representation it is part of, or rather which engine uses it). > > Also, are we asking engines (or their Iceberg implementation) to throw an > exception if the full storage table identifier was provided as part of the > MV definition? Sounds like a not very ideal UX. Note that it also conflicts > with the spirit of requirement #3. > > Thanks, > Walaa. > > On Thu, Sep 19, 2024 at 10:02 AM Benny Chow <btc...@gmail.com> wrote: > >> Hi Jan >> >> "PartialIdentifier" without the catalog name sounds good to me. The >> storage table and MV have to be in the same catalog. That would be a good >> fifth requirement to add to the list. >> >> Thanks >> Benny >> >> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >>> Cool, I guess it's easier to resolve these kind of things when talking >>> in person. >>> >>> I agree with your requirements and the conclusion to use a map from UUID >>> to snapshot-id/version-id as the refresh-state, as well as dropping the >>> lineage in favor to just re-parsing the SQL query. This gets us around the >>> "catalog alias" issue. >>> >>> And I'm also OK with every engine requiring their own representation to >>> use the MV. >>> >>> There is still the issue with the identifier of the storage table and >>> its catalog_name. Should we use an "PartialIdentifier" with a namespace and >>> a name field, like so: >>> >>> { >>> >>> namespace: ["bronze"], >>> >>> name: "lineitem" >>> >>> } >>> >>> And require the storage table to be in the same catalog as the MV itself? >>> >>> Thanks, >>> >>> Jan >>> On 19.09.24 00:50, Benny Chow wrote: >>> >>> Steven and I met up yesterday at the Seattle Iceberg meetup and we got >>> to talking about the "catalog alias" issue. He described it as an annoying >>> problem =p >>> >>> I think there are some key requirements we need to support: >>> >>> 1. Different engines can produce and consume shared MVs with freshness >>> validation. >>> 2. We cannot force different engines to standardize on the alias they >>> use for the catalog. >>> 3. We cannot force different SQL representations to exclude catalog >>> names from table identifiers or not use fully qualified table names. >>> 4. MV SQL can join tables and views from multiple catalogs -> Inevitable >>> with Nessie, Polaris, Unity, Tabular and others... >>> >>> The producing engine has to save refresh state information to let >>> consuming engine know that table X is at what snapshot at the time of >>> materialization. The only way to identify this table across different >>> catalog names is to use the cross catalog, globally unique UUID. I think >>> our only option is to have the refresh state map UUID to snapshot ids and >>> view version ids. >>> >>> Assuming the above is how we store the refresh state, how does the >>> consuming engine determine the current snapshot ids? The consuming engine >>> will have to fully expand the query tree at which point it will have the >>> UUIDs as well as the latest snapshot ids/view versions. This can then be >>> diffed against the materialization refresh state to determine freshness. >>> There isn't a need to store the view lineage information to map from UUID >>> to the consumer specific identifier so that the consumer can then call back >>> into the catalog with that identifier to get the latest state. The >>> consuming engine might as well just re-parse the SQL and expand the query. >>> >>> Personally, I'm OK with requiring that an engine must have its own SQL >>> representation in order to use the MV. To me, being able to fulfill the >>> key requirements above is much more important. >>> >>> Thanks >>> Benny >>> >>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <jank...@mailbox.org.invalid> >>> <jank...@mailbox.org.invalid> wrote: >>> >>>> How about we make the *catalog_name field* of the identifier optional? >>>> If the field is missing, it references a table/view in the same catalog. If >>>> it is present it has to be an engine agnostic catalog name. Shouldn't the >>>> catalog_names from the REST catalog spec be engine agnostic? >>>> >>>> I was wondering, is there no way to prescribe a catalog_name in Spark >>>> or Dremio? What do you do if you include two Nessie catalogs? They can't >>>> both be called LocalNessie. >>>> >>>> Thanks, >>>> >>>> Jan >>>> On 14.09.24 01:23, Benny Chow wrote: >>>> >>>> The main reason for putting the lineage into the view is so that >>>> "another" engine can enumerate out the tables in the view without needing >>>> to parse any SQL. But, if we put the lineage under the SQL representation >>>> with engine specific catalog names, the "other" engine is not going to be >>>> able to use those identifiers to look up the tables. The "other" engine >>>> can only lookup those identifiers using its engine specific catalog name. >>>> It may be possible to enumerate the tables at the view version level ONLY >>>> if those identifiers don't include the catalog name. However, if you have >>>> a view with a cross catalog join, then the tables coming from the other >>>> catalog have to be fully qualified. But then the problem is that each >>>> engine will also alias the other catalog differently too. >>>> >>>> So, I think to summarize *multi-engine* view interoperability: >>>> >>>> - default-catalog can't be specified >>>> - default-namespace can be specified >>>> - View SQL can only references tables/views from the same catalog >>>> >>>> I think these are reasonable constraints for multi-engine use cases. >>>> If reasonable, for MVs, then the storage table, refresh-state and lineage >>>> (at the view version level), could all be based on *engine agnostic* >>>> identifiers without the catalog name. The MV and storage table would have >>>> to be in the same catalog. >>>> >>>> Thanks >>>> Benny >>>> >>>> >>>> >>>> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>> Hi, >>>>> >>>>> regarding our recent discussion on table identifiers with respect to >>>>> different catalog_names with different query engines. We have the same >>>>> problem when we want to reference the storage table from the common view. >>>>> *If we include the catalog_name as part of the identifier, different >>>>> query engines might not be able to load the storage table. * >>>>> We could enforce that every storage table has to be part of the same >>>>> catalog as the main view. This way an identifier without the catalog_name >>>>> would be enough to point to the correct storage table. >>>>> >>>>> What are your thoughts on this? >>>>> >>>>> Best wishes, >>>>> >>>>> Jan >>>>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote: >>>>> >>>>> I think this type of discussion is exactly what motivates a >>>>> clarification in the view spec so that we can resolve MV lineage. Will >>>>> create separate thread for view spec clarification. >>>>> >>>>> Following up on Jan’s point, yes I agree in order to support catalog >>>>> name, it should be at the representation level, but catalog name does not >>>>> really depend on the “dialect” but rather on the “engine”; hence the >>>>> discussion becomes a little more involved. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <jank...@mailbox.org.invalid> >>>>> <jank...@mailbox.org.invalid> wrote: >>>>> >>>>>> Hi Benny, >>>>>> >>>>>> I think that identifiers only being defined for a certain >>>>>> representation is exactly what we want. Each representation can define >>>>>> their own identifiers that then map to an UUID. This way the >>>>>> "catalog_name" >>>>>> of the identifier for a "Spark" dialect can be different then for a >>>>>> "Dremio" dialect. >>>>>> >>>>>> The important part is that we still have a list of identifiers for >>>>>> each representation that we can use with the catalog to obtain the state >>>>>> of >>>>>> the source tables. >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Jan >>>>>> On 11.09.24 01:33, Benny Chow wrote: >>>>>> >>>>>> Hi Walaa, I don't think the current view spec implicitly assumes a >>>>>> common catalog name between engines. I tested this by not specifying the >>>>>> default-catalog and both engines could look up the correct table under >>>>>> the >>>>>> shared default-namespace even when each engine uses a different catalog >>>>>> name. >>>>>> >>>>>> Hi Jan, I think the issue with putting the lineage as part of the >>>>>> representation is that that identifier only makes sense for that >>>>>> representation's engine. In your example, the catalog aliased as >>>>>> "iceberg" >>>>>> in spark is going to have a different name in Dremio or Trino. >>>>>> >>>>>> IMO, if we are to store a lineage for a view, it should consist of >>>>>> something engine agnostic like the table/view UUIDs. This would be >>>>>> stored >>>>>> at the view version level and not the representation level. I think as >>>>>> we >>>>>> get into more of these multi-engine, multi-catalog use cases for views, >>>>>> the >>>>>> Iceberg Catalog is going to need to do a better job at handling CRUD by >>>>>> UUID instead of engine specific identifiers. Another scenario we need to >>>>>> think through is a view that joins tables from two different catalogs. >>>>>> How >>>>>> would we represent the lineage for that in an engine agnostic way? >>>>>> >>>>>> Thanks >>>>>> Benny >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul >>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>> >>>>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a >>>>>>> better understanding now. Sorry for being a bit stubborn before. >>>>>>> >>>>>>> Wouldn't it make sense then to store the lineage as part of the >>>>>>> representation: >>>>>>> >>>>>>> { >>>>>>> >>>>>>> "type": "sql", >>>>>>> >>>>>>> "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>>>> events\nGROUP BY 2", >>>>>>> >>>>>>> "dialect": "spark", >>>>>>> >>>>>>> "lineage": [{ >>>>>>> >>>>>>> "identifier": { "catalog": "iceberg", "namespace": "public", >>>>>>> "table": "events"}, >>>>>>> >>>>>>> "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385" >>>>>>> >>>>>>> }] >>>>>>> >>>>>>> } >>>>>>> >>>>>>> Best wishes, >>>>>>> >>>>>>> Jan >>>>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote: >>>>>>> >>>>>>> Benny, thank you so much for performing the experiment. Glad that >>>>>>> using UUIDs as keys in the state map makes more sense now. >>>>>>> >>>>>>> For the issue with the view spec being restrictive, I agree and I >>>>>>> have raised the concern on the view spec PR last year [1]. I think >>>>>>> there is >>>>>>> some area of improvement here. At the least, if it is restrictive, it >>>>>>> should be explicitly stated. I will start a thread on how to approach >>>>>>> the >>>>>>> view spec. We may need to get more insight on the view spec before >>>>>>> finalizing the MV spec, because view spec will determine if we should >>>>>>> proceed with one lineage (with the implicitly assumed common catalog >>>>>>> name), >>>>>>> or with multiple lineages (one per engine or catalog name). >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619 >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Walaa >>>>>>>> >>>>>>>> I did some testing with two different engines (Spark and Dremio) >>>>>>>> against the same Nessie catalog and created the attached materialized >>>>>>>> view >>>>>>>> metadata.json. I see your point now about the SQL identifiers being >>>>>>>> tightly coupled to the engines. In the metadata JSON, spark refers to >>>>>>>> the >>>>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as >>>>>>>> "LocalNessie". So, this means that the fully qualified view and table >>>>>>>> identifiers are engine specific and Dremio can't lookup a Spark >>>>>>>> identifier >>>>>>>> and vice versa. >>>>>>>> >>>>>>>> *So, I think it does make sense now for the refresh-state to key >>>>>>>> off the UUIDs and not use engine specific identifiers. *This also >>>>>>>> means that the materization consumer will have to fully expand the >>>>>>>> query >>>>>>>> tree and basically diff the UUID + latest snapshot ids against the >>>>>>>> refresh >>>>>>>> state. Would it ever make sense for the Iceberg Catalog to expose a >>>>>>>> bulk >>>>>>>> lookup API by UUID? >>>>>>>> >>>>>>>> As a side note, it seems that for a materialized view to work with >>>>>>>> multiple engines, the default-catalog and default-namespace can't be >>>>>>>> used >>>>>>>> unless both engines use the same catalog name which seems pretty >>>>>>>> restrictive to me. >>>>>>>> >>>>>>>> Thanks for the great discussions >>>>>>>> Benny >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa < >>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Jan, we definitely can store SQL identifiers of multiple >>>>>>>>> representations in Approach 1. >>>>>>>>> >>>>>>>>> The takeaway is that SQL identifiers are highly coupled with >>>>>>>>> engines, just like views. It makes sense to track both together for >>>>>>>>> consistency. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>>>>>> Walaa, thanks you for bringing up this use case. I think we need >>>>>>>>>> to keep in mind that we require identifiers to interface with the >>>>>>>>>> catalog. >>>>>>>>>> We cannot use UUIDs. >>>>>>>>>> >>>>>>>>>> Which means you also wouldn't be able to use Approach 1 for your >>>>>>>>>> use case because you can't store the catalog names of multiple >>>>>>>>>> representations in the lineage. You would need to fallback to >>>>>>>>>> parsing the >>>>>>>>>> SQL for a particular representation and rebuilding the full query >>>>>>>>>> tree to >>>>>>>>>> obtain the identifiers. >>>>>>>>>> >>>>>>>>>> You could do the same for Approach 2. So I don't see why Approach >>>>>>>>>> 1 would yield any benefits. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Jan >>>>>>>>>> On 07.09.24 00:01, Steven Wu wrote: >>>>>>>>>> >>>>>>>>>> Benny, `default-catalog` is optional, while `default-namespace` >>>>>>>>>> is required. >>>>>>>>>> >>>>>>>>>> I will retract my comment on the `summary`. it indicates the >>>>>>>>>> engine that made the revision to the current view version. it doesn't >>>>>>>>>> really matter for multi-engine/representation support. >>>>>>>>>> >>>>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it >>>>>>>>>>> wouldn't have to be under a specific representation. >>>>>>>>>>> Walaa - That's a serious concern... If the same catalog is >>>>>>>>>>> aliased differently by two different engines, then the basic view >>>>>>>>>>> spec >>>>>>>>>>> seems broken to me since "default-namespace" includes the catalog >>>>>>>>>>> alias and >>>>>>>>>>> is outside of the SQL representation. Does that mean for a view to >>>>>>>>>>> be >>>>>>>>>>> interoperable, we require different engines to use the same catalog >>>>>>>>>>> name? >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple >>>>>>>>>>>> representations (for different engines), which definitely requires >>>>>>>>>>>> more >>>>>>>>>>>> discussion from the community. >>>>>>>>>>>> >>>>>>>>>>>> When I am looking at the view spec, I am seeing some conflict. >>>>>>>>>>>> "summary" field seems meant for only one engine, while >>>>>>>>>>>> "representations" >>>>>>>>>>>> support multiple engines. >>>>>>>>>>>> >>>>>>>>>>>> "summary" : { >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16> >>>>>>>>>>>> "engine-name" : "Spark", >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17> >>>>>>>>>>>> "engineVersion" : "3.3.2" >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> }, >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19> >>>>>>>>>>>> "representations" : [ { >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> >>>>>>>>>>>> "type" : "sql", >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21> >>>>>>>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>>>>>>>>> events\nGROUP BY >>>>>>>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22> >>>>>>>>>>>> "dialect" : "spark" >>>>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ] >>>>>>>>>>>> >>>>>>>>>>>> With multiple representations/engines, I guess one engine will >>>>>>>>>>>> be responsible for the storage table refresh and other engines are >>>>>>>>>>>> read >>>>>>>>>>>> only. If we want to store the lineage info in the view, it probably >>>>>>>>>>>> needs to be part of the "representation" struct so that each >>>>>>>>>>>> engine/representation stores its own lineage info.. >>>>>>>>>>>> Who is to validate/ensure that the SQL representation is >>>>>>>>>>>> actually semantically identical (minus syntax differences across >>>>>>>>>>>> engines)? >>>>>>>>>>>> I guess this responsibility is left to the user who owns and >>>>>>>>>>>> manages the >>>>>>>>>>>> view. >>>>>>>>>>>> >>>>>>>>>>>>