Hi Jan "PartialIdentifier" without the catalog name sounds good to me. The storage table and MV have to be in the same catalog. That would be a good fifth requirement to add to the list.
Thanks Benny On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul <jank...@mailbox.org.invalid> wrote: > Cool, I guess it's easier to resolve these kind of things when talking in > person. > > I agree with your requirements and the conclusion to use a map from UUID > to snapshot-id/version-id as the refresh-state, as well as dropping the > lineage in favor to just re-parsing the SQL query. This gets us around the > "catalog alias" issue. > > And I'm also OK with every engine requiring their own representation to > use the MV. > > There is still the issue with the identifier of the storage table and its > catalog_name. Should we use an "PartialIdentifier" with a namespace and a > name field, like so: > > { > > namespace: ["bronze"], > > name: "lineitem" > > } > > And require the storage table to be in the same catalog as the MV itself? > > Thanks, > > Jan > On 19.09.24 00:50, Benny Chow wrote: > > Steven and I met up yesterday at the Seattle Iceberg meetup and we got to > talking about the "catalog alias" issue. He described it as an annoying > problem =p > > I think there are some key requirements we need to support: > > 1. Different engines can produce and consume shared MVs with freshness > validation. > 2. We cannot force different engines to standardize on the alias they use > for the catalog. > 3. We cannot force different SQL representations to exclude catalog names > from table identifiers or not use fully qualified table names. > 4. MV SQL can join tables and views from multiple catalogs -> Inevitable > with Nessie, Polaris, Unity, Tabular and others... > > The producing engine has to save refresh state information to let > consuming engine know that table X is at what snapshot at the time of > materialization. The only way to identify this table across different > catalog names is to use the cross catalog, globally unique UUID. I think > our only option is to have the refresh state map UUID to snapshot ids and > view version ids. > > Assuming the above is how we store the refresh state, how does the > consuming engine determine the current snapshot ids? The consuming engine > will have to fully expand the query tree at which point it will have the > UUIDs as well as the latest snapshot ids/view versions. This can then be > diffed against the materialization refresh state to determine freshness. > There isn't a need to store the view lineage information to map from UUID > to the consumer specific identifier so that the consumer can then call back > into the catalog with that identifier to get the latest state. The > consuming engine might as well just re-parse the SQL and expand the query. > > Personally, I'm OK with requiring that an engine must have its own SQL > representation in order to use the MV. To me, being able to fulfill the > key requirements above is much more important. > > Thanks > Benny > > On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul <jank...@mailbox.org.invalid> > <jank...@mailbox.org.invalid> wrote: > >> How about we make the *catalog_name field* of the identifier optional? >> If the field is missing, it references a table/view in the same catalog. If >> it is present it has to be an engine agnostic catalog name. Shouldn't the >> catalog_names from the REST catalog spec be engine agnostic? >> >> I was wondering, is there no way to prescribe a catalog_name in Spark or >> Dremio? What do you do if you include two Nessie catalogs? They can't both >> be called LocalNessie. >> >> Thanks, >> >> Jan >> On 14.09.24 01:23, Benny Chow wrote: >> >> The main reason for putting the lineage into the view is so that >> "another" engine can enumerate out the tables in the view without needing >> to parse any SQL. But, if we put the lineage under the SQL representation >> with engine specific catalog names, the "other" engine is not going to be >> able to use those identifiers to look up the tables. The "other" engine >> can only lookup those identifiers using its engine specific catalog name. >> It may be possible to enumerate the tables at the view version level ONLY >> if those identifiers don't include the catalog name. However, if you have >> a view with a cross catalog join, then the tables coming from the other >> catalog have to be fully qualified. But then the problem is that each >> engine will also alias the other catalog differently too. >> >> So, I think to summarize *multi-engine* view interoperability: >> >> - default-catalog can't be specified >> - default-namespace can be specified >> - View SQL can only references tables/views from the same catalog >> >> I think these are reasonable constraints for multi-engine use cases. If >> reasonable, for MVs, then the storage table, refresh-state and lineage (at >> the view version level), could all be based on *engine agnostic* >> identifiers without the catalog name. The MV and storage table would have >> to be in the same catalog. >> >> Thanks >> Benny >> >> >> >> On Fri, Sep 13, 2024 at 2:08 AM Jan Kaul <jank...@mailbox.org.invalid> >> <jank...@mailbox.org.invalid> wrote: >> >>> Hi, >>> >>> regarding our recent discussion on table identifiers with respect to >>> different catalog_names with different query engines. We have the same >>> problem when we want to reference the storage table from the common view. >>> *If we include the catalog_name as part of the identifier, different >>> query engines might not be able to load the storage table. * >>> We could enforce that every storage table has to be part of the same >>> catalog as the main view. This way an identifier without the catalog_name >>> would be enough to point to the correct storage table. >>> >>> What are your thoughts on this? >>> >>> Best wishes, >>> >>> Jan >>> On 11.09.24 16:05, Walaa Eldin Moustafa wrote: >>> >>> I think this type of discussion is exactly what motivates a >>> clarification in the view spec so that we can resolve MV lineage. Will >>> create separate thread for view spec clarification. >>> >>> Following up on Jan’s point, yes I agree in order to support catalog >>> name, it should be at the representation level, but catalog name does not >>> really depend on the “dialect” but rather on the “engine”; hence the >>> discussion becomes a little more involved. >>> >>> Thanks, >>> Walaa. >>> >>> On Wed, Sep 11, 2024 at 1:11 PM Jan Kaul <jank...@mailbox.org.invalid> >>> <jank...@mailbox.org.invalid> wrote: >>> >>>> Hi Benny, >>>> >>>> I think that identifiers only being defined for a certain >>>> representation is exactly what we want. Each representation can define >>>> their own identifiers that then map to an UUID. This way the "catalog_name" >>>> of the identifier for a "Spark" dialect can be different then for a >>>> "Dremio" dialect. >>>> >>>> The important part is that we still have a list of identifiers for each >>>> representation that we can use with the catalog to obtain the state of the >>>> source tables. >>>> >>>> Best wishes, >>>> >>>> Jan >>>> On 11.09.24 01:33, Benny Chow wrote: >>>> >>>> Hi Walaa, I don't think the current view spec implicitly assumes a >>>> common catalog name between engines. I tested this by not specifying the >>>> default-catalog and both engines could look up the correct table under the >>>> shared default-namespace even when each engine uses a different catalog >>>> name. >>>> >>>> Hi Jan, I think the issue with putting the lineage as part of the >>>> representation is that that identifier only makes sense for that >>>> representation's engine. In your example, the catalog aliased as "iceberg" >>>> in spark is going to have a different name in Dremio or Trino. >>>> >>>> IMO, if we are to store a lineage for a view, it should consist of >>>> something engine agnostic like the table/view UUIDs. This would be stored >>>> at the view version level and not the representation level. I think as we >>>> get into more of these multi-engine, multi-catalog use cases for views, the >>>> Iceberg Catalog is going to need to do a better job at handling CRUD by >>>> UUID instead of engine specific identifiers. Another scenario we need to >>>> think through is a view that joins tables from two different catalogs. How >>>> would we represent the lineage for that in an engine agnostic way? >>>> >>>> Thanks >>>> Benny >>>> >>>> >>>> >>>> On Tue, Sep 10, 2024 at 7:21 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>> Thanks Walaa and Benny for clarifying the problem. I think I have a >>>>> better understanding now. Sorry for being a bit stubborn before. >>>>> >>>>> Wouldn't it make sense then to store the lineage as part of the >>>>> representation: >>>>> >>>>> { >>>>> >>>>> "type": "sql", >>>>> >>>>> "sql": "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM >>>>> events\nGROUP BY 2", >>>>> >>>>> "dialect": "spark", >>>>> >>>>> "lineage": [{ >>>>> >>>>> "identifier": { "catalog": "iceberg", "namespace": "public", >>>>> "table": "events"}, >>>>> >>>>> "uuid": "fa6506c3-7681-40c8-86dc-e36561f83385" >>>>> >>>>> }] >>>>> >>>>> } >>>>> >>>>> Best wishes, >>>>> >>>>> Jan >>>>> On 09.09.24 11:59, Walaa Eldin Moustafa wrote: >>>>> >>>>> Benny, thank you so much for performing the experiment. Glad that >>>>> using UUIDs as keys in the state map makes more sense now. >>>>> >>>>> For the issue with the view spec being restrictive, I agree and I have >>>>> raised the concern on the view spec PR last year [1]. I think there is >>>>> some >>>>> area of improvement here. At the least, if it is restrictive, it should be >>>>> explicitly stated. I will start a thread on how to approach the view spec. >>>>> We may need to get more insight on the view spec before finalizing the MV >>>>> spec, because view spec will determine if we should proceed with one >>>>> lineage (with the implicitly assumed common catalog name), or with >>>>> multiple >>>>> lineages (one per engine or catalog name). >>>>> >>>>> [1] >>>>> https://github.com/apache/iceberg/pull/7992#issuecomment-1763172619 >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Mon, Sep 9, 2024 at 3:28 AM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> Hi Walaa >>>>>> >>>>>> I did some testing with two different engines (Spark and Dremio) >>>>>> against the same Nessie catalog and created the attached materialized >>>>>> view >>>>>> metadata.json. I see your point now about the SQL identifiers being >>>>>> tightly coupled to the engines. In the metadata JSON, spark refers to >>>>>> the >>>>>> catalog as "SparkNessie", whereas Dremio refers to the catalog as >>>>>> "LocalNessie". So, this means that the fully qualified view and table >>>>>> identifiers are engine specific and Dremio can't lookup a Spark >>>>>> identifier >>>>>> and vice versa. >>>>>> >>>>>> *So, I think it does make sense now for the refresh-state to key off >>>>>> the UUIDs and not use engine specific identifiers. *This also means >>>>>> that the materization consumer will have to fully expand the query tree >>>>>> and >>>>>> basically diff the UUID + latest snapshot ids against the refresh state. >>>>>> Would it ever make sense for the Iceberg Catalog to expose a bulk lookup >>>>>> API by UUID? >>>>>> >>>>>> As a side note, it seems that for a materialized view to work with >>>>>> multiple engines, the default-catalog and default-namespace can't be used >>>>>> unless both engines use the same catalog name which seems pretty >>>>>> restrictive to me. >>>>>> >>>>>> Thanks for the great discussions >>>>>> Benny >>>>>> >>>>>> >>>>>> On Sat, Sep 7, 2024 at 2:49 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Jan, we definitely can store SQL identifiers of multiple >>>>>>> representations in Approach 1. >>>>>>> >>>>>>> The takeaway is that SQL identifiers are highly coupled with >>>>>>> engines, just like views. It makes sense to track both together for >>>>>>> consistency. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> On Sat, Sep 7, 2024 at 8:15 AM Jan Kaul >>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>> >>>>>>>> Walaa, thanks you for bringing up this use case. I think we need to >>>>>>>> keep in mind that we require identifiers to interface with the >>>>>>>> catalog. We >>>>>>>> cannot use UUIDs. >>>>>>>> >>>>>>>> Which means you also wouldn't be able to use Approach 1 for your >>>>>>>> use case because you can't store the catalog names of multiple >>>>>>>> representations in the lineage. You would need to fallback to parsing >>>>>>>> the >>>>>>>> SQL for a particular representation and rebuilding the full query tree >>>>>>>> to >>>>>>>> obtain the identifiers. >>>>>>>> >>>>>>>> You could do the same for Approach 2. So I don't see why Approach 1 >>>>>>>> would yield any benefits. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Jan >>>>>>>> On 07.09.24 00:01, Steven Wu wrote: >>>>>>>> >>>>>>>> Benny, `default-catalog` is optional, while `default-namespace` is >>>>>>>> required. >>>>>>>> >>>>>>>> I will retract my comment on the `summary`. it indicates the engine >>>>>>>> that made the revision to the current view version. it doesn't really >>>>>>>> matter for multi-engine/representation support. >>>>>>>> >>>>>>>> On Fri, Sep 6, 2024 at 2:49 PM Benny Chow <btc...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Steven - Ideally, the lineage is engine agnostic so I'd hope it >>>>>>>>> wouldn't have to be under a specific representation. >>>>>>>>> Walaa - That's a serious concern... If the same catalog is >>>>>>>>> aliased differently by two different engines, then the basic view spec >>>>>>>>> seems broken to me since "default-namespace" includes the catalog >>>>>>>>> alias and >>>>>>>>> is outside of the SQL representation. Does that mean for a view to be >>>>>>>>> interoperable, we require different engines to use the same catalog >>>>>>>>> name? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Fri, Sep 6, 2024 at 1:29 PM Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Walaa, thanks for bringing up the interesting case of multiple >>>>>>>>>> representations (for different engines), which definitely requires >>>>>>>>>> more >>>>>>>>>> discussion from the community. >>>>>>>>>> >>>>>>>>>> When I am looking at the view spec, I am seeing some conflict. >>>>>>>>>> "summary" field seems meant for only one engine, while >>>>>>>>>> "representations" >>>>>>>>>> support multiple engines. >>>>>>>>>> >>>>>>>>>> "summary" : { >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-16> >>>>>>>>>> "engine-name" : "Spark", >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-17> >>>>>>>>>> "engineVersion" : "3.3.2" >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-18> }, >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-19> >>>>>>>>>> "representations" : [ { >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-20> "type" >>>>>>>>>> : "sql", >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-21> "sql" >>>>>>>>>> : "SELECT\n COUNT(1), CAST(event_ts AS DATE)\nFROM events\nGROUP BY >>>>>>>>>> 2", >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-22> >>>>>>>>>> "dialect" : "spark" >>>>>>>>>> <https://iceberg.apache.org/view-spec/#__codelineno-5-23> } ] >>>>>>>>>> >>>>>>>>>> With multiple representations/engines, I guess one engine will be >>>>>>>>>> responsible for the storage table refresh and other engines are read >>>>>>>>>> only. >>>>>>>>>> If we want to store the lineage info in the view, it probably needs >>>>>>>>>> to be >>>>>>>>>> part of the "representation" struct so that each >>>>>>>>>> engine/representation >>>>>>>>>> stores its own lineage info.. >>>>>>>>>> Who is to validate/ensure that the SQL representation is actually >>>>>>>>>> semantically identical (minus syntax differences across engines)? I >>>>>>>>>> guess >>>>>>>>>> this responsibility is left to the user who owns and manages the >>>>>>>>>> view. >>>>>>>>>> >>>>>>>>>>