I think this type of discussion is exactly what motivates a clarification in the view spec so that we can resolve MV lineage. Will create separate thread for view spec clarification.
Following up on Jan’s point, yes I agree in order to support catalog name, it should be at the representation level, but catalog name does not really depend on the “dialect” but rather on the “engine”; hence the discussion becomes a little more involved. Thanks, Walaa. On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <jank...@mailbox.org.invalid> wrote: > Hi Benny, > > I think that identifiers only being defined for a certain representation > is exactly what we want. Each representation can define their own > identifiers that then map to an UUID. This way the "catalog_name" of the > identifier for a "Spark" dialect can be different then for a "Dremio" > dialect. > > The important part is that we still have a list of identifiers for each > representation that we can use with the catalog to obtain the state of the > source tables. > > Best wishes, > > Jan > On 11.09.24 01:33, Benny Chow wrote: > > Hi Walaa, I don't think the current view spec implicitly assumes a common > catalog name between engines. I tested this by not specifying the > default-catalog and both engines could look up the correct table under the > shared default-namespace even when each engine uses a different catalog > name. > > Hi Jan, I think the issue with putting the lineage as part of the > representation is that that identifier only makes sense for that > representation's engine. In your example, the catalog aliased as "iceberg" > in spark is going to have a different name in Dremio or Trino. > > IMO, if we are to store a lineage for a view, it should consist of > something engine agnostic like the table/view UUIDs. This would be stored > at the view version level and not the representation level. I think as we > get into more of these multi-engine, multi-catalog use cases for views, the > Iceberg Catalog is going to need to do a better job at handling CRUD by > UUID instead of engine specific identifiers. Another scenario we need to > think through is a view that joins tables from two different catalogs. How > would we represent the lineage for that in an engine agnostic way? > > Thanks > Benny > > > > On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <jank...@mailbox.org.invalid> > <jank...@mailbox.org.invalid> wrote: > >> Thanks Walaa and Benny for clarifying the problem. I think I have a >> better understanding now. Sorry for being a bit stubborn before. >> >> Wouldn't it make sense then to store the lineage as part of the >> representation: >> >> { >> >> "type": "sql", >> >> "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP >> BY 2", >> >> "dialect": "spark", >> >> "lineage": [{ >> >> "identifier": { "catalog": "iceberg", "namespace": "public", >> "table": "events"}, >> >> "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385" >> >> }] >> >> } >> >> Best wishes, >> >> Jan >> On 09.09.24 11:59, Walaa Eldin Moustafa wrote: >> >> Benny, thank you so much for performing the experiment. Glad that using >> UUIDs as keys in the state map makes more sense now. >> >> For the issue with the view spec being restrictive, I agree and I have >> raised the concern on the view spec PR last year [1]. I think there is some >> area of improvement here. At the least, if it is restrictive, it should be >> explicitly stated. I will start a thread on how to approach the view spec. >> We may need to get more insight on the view spec before finalizing the MV >> spec, because view spec will determine if we should proceed with one >> lineage (with the implicitly assumed common catalog name), or with multiple >> lineages (one per engine or catalog name). >> >> [1] https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619 >> >> Thanks, >> Walaa. >> >> >> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote: >> >>> Hi Walaa >>> >>> I did some testing with two different engines (Spark and Dremio) against >>> the same Nessie catalog and created the attached materialized view >>> metadata.json. I see your point now about the SQL identifiers being >>> tightly coupled to the engines. In the metadata JSON, spark refers to the >>> catalog as "SparkNessie", whereas Dremio refers to the catalog as >>> "LocalNessie". So, this means that the fully qualified view and table >>> identifiers are engine specific and Dremio can't lookup a Spark identifier >>> and vice versa. >>> >>> *So, I think it does make sense now for the refresh-state to key off the >>> UUIDs and not use engine specific identifiers. *This also means that >>> the materization consumer will have to fully expand the query tree and >>> basically diff the UUID + latest snapshot ids against the refresh state. >>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup >>> API by UUID? >>> >>> As a side note, it seems that for a materialized view to work with >>> multiple engines, the default-catalog and default-namespace can't be used >>> unless both engines use the same catalog name which seems pretty >>> restrictive to me. >>> >>> Thanks for the great discussions >>> Benny >>> >>> >>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> Jan, we definitely can store SQL identifiers of multiple >>>> representations in Approach 1. >>>> >>>> The takeaway is that SQL identifiers are highly coupled with engines, >>>> just like views. It makes sense to track both together for consistency. >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>> Walaa, thanks you for bringing up this use case. I think we need to >>>>> keep in mind that we require identifiers to interface with the catalog. We >>>>> cannot use UUIDs. >>>>> >>>>> Which means you also wouldn't be able to use Approach 1 for your use >>>>> case because you can't store the catalog names of multiple representations >>>>> in the lineage. You would need to fallback to parsing the SQL for a >>>>> particular representation and rebuilding the full query tree to obtain the >>>>> identifiers. >>>>> >>>>> You could do the same for Approach 2. So I don't see why Approach 1 >>>>> would yield any benefits. >>>>> >>>>> Thanks, >>>>> >>>>> Jan >>>>> On 07.09.24 00:01, Steven Wu wrote: >>>>> >>>>> Benny, `default-catalog` is optional, while `default-namespace` is >>>>> required. >>>>> >>>>> I will retract my comment on the `summary`. it indicates the engine >>>>> that made the revision to the current view version. it doesn't really >>>>> matter for multi-engine/representation support. >>>>> >>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it >>>>>> wouldn't have to be under a specific representation. >>>>>> Walaa - That's a serious concern... If the same catalog is aliased >>>>>> differently by two different engines, then the basic view spec seems >>>>>> broken >>>>>> to me since "default-namespace" includes the catalog alias and is outside >>>>>> of the SQL representation. Does that mean for a view to be >>>>>> interoperable, >>>>>> we require different engines to use the same catalog name? >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Walaa, thanks for bringing up the interesting case of multiple >>>>>>> representations (for different engines), which definitely requires more >>>>>>> discussion from the community. >>>>>>> >>>>>>> When I am looking at the view spec, I am seeing some conflict. >>>>>>> "summary" field seems meant for only one engine, while "representations" >>>>>>> support multiple engines. >>>>>>> >>>>>>> "summary" : { >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16> >>>>>>> "engine-name" : "Spark", >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17> >>>>>>> "engineVersion" : "3.3.2" >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> }, >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19> >>>>>>> "representations" : [ { >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" : >>>>>>> "sql", <https://iceberg.apache.org/view-spec/#__codelineno-5-21> >>>>>>> "sql" : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP >>>>>>> BY >>>>>>> 2", <https://iceberg.apache.org/view-spec/#__codelineno-5-22> >>>>>>> "dialect" : "spark" >>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ] >>>>>>> >>>>>>> With multiple representations/engines, I guess one engine will be >>>>>>> responsible for the storage table refresh and other engines are read >>>>>>> only. >>>>>>> If we want to store the lineage info in the view, it probably needs to >>>>>>> be >>>>>>> part of the "representation" struct so that each engine/representation >>>>>>> stores its own lineage info.. >>>>>>> Who is to validate/ensure that the SQL representation is actually >>>>>>> semantically identical (minus syntax differences across engines)? I >>>>>>> guess >>>>>>> this responsibility is left to the user who owns and manages the view. >>>>>>> >>>>>>>