Thanks Steven. I feel we have not completely agreed on the PR yet. I also think it will be a dependency to move forward here. Let me address the comments and get further feedback.
On Sun, Nov 10, 2024 at 10:46 AM Steven Wu <stevenz...@gmail.com> wrote: > I also feel good about moving forward with Jan's latest description. > > I would favor including catalog-name (optional), namespace, table name to > enhance readability. But either way it is not a blocker for me. > > Walaa, regarding your PR on view clarification [1], it is good to > clarify the expectations. But it can happen async and doesn't have to block > the MV given the conclusion on view identifiers. > > [1] https://github.com/apache/iceberg/pull/11365/ > > On Sat, Nov 9, 2024 at 7:06 AM Jan Kaul <jank...@mailbox.org.invalid> > wrote: > >> Hello everyone, >> >> Following the conclusion of the thread on clarifying the View spec [1], >> which established that identifiers are clearly defined in the View spec, >> I'd like to propose that we resolve this discussion and move forward with >> the materialized view spec. >> >> Even if we stay completely clear of identifiers, we have all the >> necessary components to implement materialized views. We can utilize SQL >> parsing to obtain catalog identifiers, allowing us to use a refresh-state >> that only contains the UUID and snapshot_id/version_ids. >> >> To enhance readability, we could optionally extend this solution to >> include storing the catalog-name (optional), namespace, and name of the >> entity in the refresh-state. I don't have a strong opinion about this. >> >> I'm eager to move the discussion forward and would appreciate any ideas >> on how to resolve this. If needed, we can schedule another online meeting >> to discuss in person. >> >> Best regards, Jan >> >> 1. >> https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:view%20spec%20improvements >> On 14.10.24 23:31, Walaa Eldin Moustafa wrote: >> >> On the table side, I would avoid encoding the SQL identifier in the >> refresh state at all since they are relevant to views and the view spec. >> >> On the view side, I would wait till we finalize the expectation from SQL >> table identifiers in the view spec. >> >> Thanks, >> Walaa. >> >> >> >> >> On Mon, Oct 14, 2024 at 1:25 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> I am favoring listing all of them in the refresh-state >>> - table/view UUID (required) >>> - table/view name (required) >>> - namespace (required) >>> - catalog name (optional) >>> >>> With catalog alias clarified, engines can use the table identifier >>> fields to load the table/view from the catalog. UUID is kept to guard >>> against the case table/view being deleted and recreated. >>> >>> >>> On Sun, Oct 13, 2024 at 11:22 PM Jan Kaul <jank...@mailbox.org.invalid> >>> <jank...@mailbox.org.invalid> wrote: >>> >>>> It looks like the discussion about adopting the View Spec concluded >>>> that the catalog alias problem is a configuration problem and that the View >>>> spec is clearly defined in terms of identifiers. >>>> >>>> With that, I would propose to not use any identifiers in the >>>> Materialized View Proposal and let the View Spec define how Identifiers are >>>> stored and resolved. I think using UUIDs gives us all the capabilities that >>>> we need and avoids adding additional complexity. >>>> >>>> Regarding adding the namespace and name to the refresh-state: I changed >>>> my mind and think that it would just be easier to stay clear from using any >>>> identifiers in the proposal. If you only store namespace and name, you >>>> could run into issues where you have collisions of tables from different >>>> catalogs. I know that this is just for documentation purposes, but I think >>>> I would be cleaner if we leave it out. >>>> >>>> Best wishes, >>>> >>>> Jan >>>> On 08.10.24 08:13, Walaa Eldin Moustafa wrote: >>>> >>>> Here is the thread for reference: >>>> >>>> https://lists.apache.org/thread/v8m1tpb91g740gmvqyphhjw37mpr8sl7 >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Mon, Oct 7, 2024 at 10:41 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> I think we have hit a major blocker in the MV spec, and it boils down >>>>> to the gaps in the view spec due to its use of SQL identifiers. >>>>> >>>>> My opinion is to fix the view spec first (regardless of specific MV >>>>> requirements) in terms of how it should handle resolving SQL identifiers, >>>>> then apply that solution to the MV spec. The path forward will be clear >>>>> then, and the solution will be consistent for both MV and non-MV use >>>>> cases. >>>>> >>>>> Since the issues are pertinent to the view spec itself, I will be >>>>> starting a new thread to discuss the view-specific topic. We can get back >>>>> to this thread once we resolve the view spec issues. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Mon, Oct 7, 2024 at 10:12 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> I'm thinking of it slightly differently. >>>>>> >>>>>> You are allowed to have multiple catalogs (like in spark) >>>>>> But you are not allowed to have a view span multiple catalogs, if you >>>>>> want something like this use a catalog that supports federation. Even in >>>>>> the federated case you would not be required for the federated catalog to >>>>>> be your only catalog. >>>>>> >>>>>> On Mon, Oct 7, 2024 at 12:07 PM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> The implicit/template catalog name won't work when the view SQL >>>>>>> definition references Iceberg tables from multiple catalogs. >>>>>>> >>>>>>> If it is only a single catalog, we can even remove the >>>>>>> `{this_catalog}` template variable and just assume it is the current >>>>>>> catalog of view definition.. >>>>>>> >>>>>>> On Mon, Oct 7, 2024 at 9:31 AM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> For Catalog Naming, could we require that all representations use a >>>>>>>> template value when we store the SQL? This could at least solve the >>>>>>>> problem >>>>>>>> for inter Catalog views, which would solve everything if we have proper >>>>>>>> catalog federation in the future. For example, "SELECT * from >>>>>>>> catalog.foo.bar" would be represented as something like "SELECT * from >>>>>>>> {this_catalog}.foo.bar". Then engines could replace the >>>>>>>> "{this_catalog}" in >>>>>>>> their representation with the current identifier assigned to the >>>>>>>> catalog? >>>>>>>> >>>>>>>> A full example with Spark >>>>>>>> >>>>>>>> Spark with spark.sql.catalog.foo = .... >>>>>>>> creates a view "SELECT * FROM foo.bar.baz" >>>>>>>> Which is stored as >>>>>>>> "SELECT * from {this_catalog}.bar.baz" >>>>>>>> >>>>>>>> Then Spark attempting to read from spark.sql.catalog.fuu = ..... >>>>>>>> Reads "SELECT * FROM fuu.bar.baz" >>>>>>>> >>>>>>>> >>>>>>>> I would couple this with requiring all representations of the view >>>>>>>> using a full identifier and being templated if the engine requires it. >>>>>>>> >>>>>>>> I apologize if this has already been discussed >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Oct 4, 2024 at 7:02 AM Jan Kaul >>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>> >>>>>>>>> I don't have a strong opinion on adding the namespace and name to >>>>>>>>> the refresh-state. Generally, I would be interested in other feedback >>>>>>>>> from >>>>>>>>> the community. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Jan >>>>>>>>> On 02.10.24 19:19, Steven Wu wrote: >>>>>>>>> >>>>>>>>> Yeah, my original intention of adding namespace and table name is >>>>>>>>> to make the state more human readable, although they are not enough >>>>>>>>> as identifiers to load source tables/views from catalog. >>>>>>>>> >>>>>>>>> Jan and I also brought up the catalog name alias issue in today's >>>>>>>>> community sync. We are looking forward to more feedback from the >>>>>>>>> community. >>>>>>>>> >>>>>>>>> On Wed, Oct 2, 2024 at 10:10 AM Benny Chow <btc...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for summarizing Jan! I agree with all your points. >>>>>>>>>> Steven suggested including the namespace and table name as part of >>>>>>>>>> the >>>>>>>>>> refresh-state table record. I think that might be a good addition >>>>>>>>>> too. >>>>>>>>>> >>>>>>>>>> Pros: >>>>>>>>>> - Improves readability of the refresh-state since table names are >>>>>>>>>> human consumable whereas UUIDs are opaque. >>>>>>>>>> - During refresh state validation, engine could also double check >>>>>>>>>> that the table name matches with what the UUID points to. >>>>>>>>>> Basically, helps >>>>>>>>>> with UUID validation in case UUID was not generated correctly. >>>>>>>>>> >>>>>>>>>> Cons: >>>>>>>>>> - Namespace + table cannot directly be used by validating engine >>>>>>>>>> to query the catalog due to the catalog alias issue. >>>>>>>>>> - Namespace + table could have dupes with cross catalog joins so >>>>>>>>>> UUID is still required here. >>>>>>>>>> >>>>>>>>>> Benny >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 1, 2024 at 11:51 PM Jan Kaul >>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Let me try to summarize the current state. With the current >>>>>>>>>>> proposal we would introduce the *storage-table* pointer into >>>>>>>>>>> the view version and the *refresh-state* into the snapshot >>>>>>>>>>> summary. There was the discussion to add a lineage struct to the >>>>>>>>>>> view >>>>>>>>>>> version but we would refrain from that. More on that later. >>>>>>>>>>> >>>>>>>>>>> *Storage-table pointer:* >>>>>>>>>>> >>>>>>>>>>> Is a partial identifier containing namespace and name of the >>>>>>>>>>> referenced storage table. It is assumed that view and storage table >>>>>>>>>>> are in >>>>>>>>>>> the same catalog. >>>>>>>>>>> >>>>>>>>>>> *Refresh-state*: >>>>>>>>>>> >>>>>>>>>>> The refresh-state is a list of all tables/views referenced in >>>>>>>>>>> the view query containing their UUID and either snapshot-id or >>>>>>>>>>> version-id >>>>>>>>>>> at the time of the last refresh. It does not contain any part of the >>>>>>>>>>> identifier like namespace or name. >>>>>>>>>>> >>>>>>>>>>> You will find a more detailed description in the PR >>>>>>>>>>> <https://github.com/apache/iceberg/pull/11041>. >>>>>>>>>>> >>>>>>>>>>> *Obtaining the identifiers of the Source tables:* >>>>>>>>>>> >>>>>>>>>>> To determine the freshness of the storage-table, we require the >>>>>>>>>>> identifiers of the source tables. There are two possibilities being >>>>>>>>>>> discussed. >>>>>>>>>>> >>>>>>>>>>> 1. Parse view query to get identifiers >>>>>>>>>>> >>>>>>>>>>> 2. Store identifiers in lineage >>>>>>>>>>> >>>>>>>>>>> As mentioned before, the lineage suffers from the "catalog >>>>>>>>>>> alias" problem where different query engines refer to the same >>>>>>>>>>> catalog with >>>>>>>>>>> different names. This also applies to the View spec. Hence, storing >>>>>>>>>>> identifiers as part of the specification is currently undesirable. >>>>>>>>>>> However, >>>>>>>>>>> we could solve the catalog alias problem in the future and then >>>>>>>>>>> adding the >>>>>>>>>>> lineage to the specification would work. >>>>>>>>>>> >>>>>>>>>>> *Regarding backwards compatibility:* >>>>>>>>>>> >>>>>>>>>>> The main component for determining the freshness is the >>>>>>>>>>> refresh-state. It contains the UUIDs and the >>>>>>>>>>> snapshot-ids/version-ids of >>>>>>>>>>> the Source tables. The only question is how does the query engine >>>>>>>>>>> obtain >>>>>>>>>>> the identifiers. For now the proposal is to use SQL parsing. >>>>>>>>>>> >>>>>>>>>>> In my opinion introducing the lineage struct in the future does >>>>>>>>>>> not break backwards compatibility as query engines can still parse >>>>>>>>>>> the SQL >>>>>>>>>>> to obtain the identifiers. Query engines are not forced to use the >>>>>>>>>>> lineage >>>>>>>>>>> and can function just as before. However, certain consumers might >>>>>>>>>>> want to >>>>>>>>>>> use the lineage as they might not understand the SQL dialect. So the >>>>>>>>>>> lineage can be seen as an extension that would allow more >>>>>>>>>>> consumers. But it >>>>>>>>>>> does not necessarily affect the way a general query engines needs to >>>>>>>>>>> process a Materialized View. >>>>>>>>>>> >>>>>>>>>>> With that I find it reasonable to move ahead without the lineage >>>>>>>>>>> as part of the view metadata. >>>>>>>>>>> >>>>>>>>>>> Best wishes, >>>>>>>>>>> >>>>>>>>>>> Jan >>>>>>>>>>> On 02.10.24 05:59, Steven Wu wrote: >>>>>>>>>>> >>>>>>>>>>> Let me recap and see if we are on the same page. >>>>>>>>>>> 1) we have some consensus on the refresh-state on the storage >>>>>>>>>>> table. it would contain these fields: UUID, snapshot-id (for table) >>>>>>>>>>> or >>>>>>>>>>> version-id (for view), namespace, table. >>>>>>>>>>> 2) there is no consensus if lineage info is needed in the view >>>>>>>>>>> definition. we want to defer/postpone that decision. >>>>>>>>>>> >>>>>>>>>>> I have a little concern on postponing the decision on the >>>>>>>>>>> lineage info. It is not a blocker to move forward. But if later we >>>>>>>>>>> decided >>>>>>>>>>> to add it to the spec, it poses new requirements for the >>>>>>>>>>> writer/reader. >>>>>>>>>>> That would be a breaking change. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Oct 1, 2024 at 8:03 PM Benny Chow <btc...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Jan >>>>>>>>>>>> >>>>>>>>>>>> Both sound good to me. (No lineage in views and assumption >>>>>>>>>>>> about UUIDs being unique across catalogs). I hope we get to >>>>>>>>>>>> voting soon on >>>>>>>>>>>> your PR.. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Benny >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Sep 28, 2024 at 10:52 AM Jan Kaul >>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Benny, >>>>>>>>>>>>> >>>>>>>>>>>>> thanks for bringing up the UUID issue. It is my understanding >>>>>>>>>>>>> that UUIDs are designed to be used in distributed systems without >>>>>>>>>>>>> the need >>>>>>>>>>>>> for a central coordination process. This is in line with the >>>>>>>>>>>>> description >>>>>>>>>>>>> from the RFC4122 document >>>>>>>>>>>>> <https://datatracker.ietf.org/doc/html/rfc4122> where it >>>>>>>>>>>>> says: "One of the main reasons for using UUIDs is that no >>>>>>>>>>>>> centralized >>>>>>>>>>>>> authority is required to administer them". >>>>>>>>>>>>> >>>>>>>>>>>>> Wikipedia >>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Universally_unique_identifier> >>>>>>>>>>>>> says: >>>>>>>>>>>>> >>>>>>>>>>>>> "While the probability that a UUID will be duplicated is not >>>>>>>>>>>>> zero, it is generally considered close enough to zero to be >>>>>>>>>>>>> negligible" >>>>>>>>>>>>> >>>>>>>>>>>>> Consequently, I would argue that it is a reasonable assumption >>>>>>>>>>>>> that UUIDs are unique across catalogs. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Regarding the identifier/catalog-alias problem: As we can >>>>>>>>>>>>> fallback to SQL parsing and don't require the lineage, I would >>>>>>>>>>>>> propose to >>>>>>>>>>>>> move ahead without the lineage. Especially as this seems to be a >>>>>>>>>>>>> problem >>>>>>>>>>>>> with the View Spec that we can't solve now. If there is a demand >>>>>>>>>>>>> to add the >>>>>>>>>>>>> lineage in the future, once the catalog-alias problem has been >>>>>>>>>>>>> solved, we >>>>>>>>>>>>> can still add it then. >>>>>>>>>>>>> >>>>>>>>>>>>> Let me know your thoughts. >>>>>>>>>>>>> >>>>>>>>>>>>> Best wishes, >>>>>>>>>>>>> >>>>>>>>>>>>> Jan >>>>>>>>>>>>> On 28.09.24 07:45, Benny Chow wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >> storing the lineage is an optimization that can avoid >>>>>>>>>>>>> recomputation/re-parsing. >>>>>>>>>>>>> I don't think having the lineage is optimizing much over >>>>>>>>>>>>> re-parsing the SQL. The most expensive part of SQL parsing is >>>>>>>>>>>>> catalog >>>>>>>>>>>>> access which has to happen with lineage anyway. Once the planner >>>>>>>>>>>>> has the >>>>>>>>>>>>> query tree, it can validate the freshness. It's not like the >>>>>>>>>>>>> planner needs >>>>>>>>>>>>> to complete logical and physical planning. >>>>>>>>>>>>> >>>>>>>>>>>>> >> We could also have the catalog name/alias problem for the >>>>>>>>>>>>> same engine. >>>>>>>>>>>>> Yes, this is a general problem with the Iceberg View spec. I >>>>>>>>>>>>> guess if two different Spark clusters wanted to share the same >>>>>>>>>>>>> view, they >>>>>>>>>>>>> best not reference the catalog name in their SQLs. Even then, >>>>>>>>>>>>> cross >>>>>>>>>>>>> catalog joins are not going to work. Again, these are problems >>>>>>>>>>>>> with the >>>>>>>>>>>>> Iceberg View spec. I think for the MV spec, as long as we don't >>>>>>>>>>>>> propose >>>>>>>>>>>>> something that involves SQL identifiers, then the MV spec isn't >>>>>>>>>>>>> making this >>>>>>>>>>>>> different engine problem worse. >>>>>>>>>>>>> >>>>>>>>>>>>> There's another issue I'd like to bring up about using UUIDs >>>>>>>>>>>>> which is that these UUIDs are client generated and there's no >>>>>>>>>>>>> validation >>>>>>>>>>>>> that they are indeed globally unique identifiers. The catalog >>>>>>>>>>>>> just >>>>>>>>>>>>> persists whatever it is given without validating that the UUIDs >>>>>>>>>>>>> are indeed >>>>>>>>>>>>> UUIDs and unique across the catalog. (I know Nessie is not doing >>>>>>>>>>>>> this >>>>>>>>>>>>> validation). We are assuming this UUID is not only unique >>>>>>>>>>>>> within a >>>>>>>>>>>>> catalog but is also unique across catalogs. Thoughts on this? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Benny >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 25, 2024 at 8:01 PM Steven Wu < >>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I agree that it is reasonable to assume/restrict view >>>>>>>>>>>>>> definition and storage table in the same catalog. Hence the >>>>>>>>>>>>>> storage table >>>>>>>>>>>>>> reference in the view metadata can include only namespace and >>>>>>>>>>>>>> table >>>>>>>>>>>>>> (excluding the engine dependent catalog name/alias). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regarding the question of having lineage metadata in view >>>>>>>>>>>>>> definition vs re-parsing SQL, I guess storing the lineage is an >>>>>>>>>>>>>> optimization that can avoid recomputation/re-parsing. would be >>>>>>>>>>>>>> good to have >>>>>>>>>>>>>> more input. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thinking about catalog name/alias again. For the same engine >>>>>>>>>>>>>> (like Spark), different applications/jobs may configure the >>>>>>>>>>>>>> catalog name >>>>>>>>>>>>>> differently. E.g. Spark catalogs are configured using >>>>>>>>>>>>>> properties under spark.sql.catalog.(catalog_name). We could >>>>>>>>>>>>>> also have the catalog name/alias problem for the same engine. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 20, 2024 at 12:16 AM Jan Kaul >>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Walaa, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It appears that you would like to maintain the lineage >>>>>>>>>>>>>>> structure and not revert to parsing the SQL to obtain >>>>>>>>>>>>>>> identifiers. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Initially, one of the reasons for avoiding SQL parsing was >>>>>>>>>>>>>>> to enable consumers who don't understand the SQL dialect of any >>>>>>>>>>>>>>> representation to determine the freshness of the Materialized >>>>>>>>>>>>>>> View (MV). >>>>>>>>>>>>>>> However, with the "catalog alias" issue, having an identifier >>>>>>>>>>>>>>> for some >>>>>>>>>>>>>>> representation is insufficient, as the *catalog_name* is >>>>>>>>>>>>>>> unlikely to work for the consumer. Therefore, supporting >>>>>>>>>>>>>>> consumers that >>>>>>>>>>>>>>> don't use a query engine of any representation seems impossible. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Given this, parsing the SQL definition becomes a less >>>>>>>>>>>>>>> significant drawback, as the consumer must understand the >>>>>>>>>>>>>>> dialect anyway. >>>>>>>>>>>>>>> In fact, simply parsing the SQL definition seems like a more >>>>>>>>>>>>>>> robust and >>>>>>>>>>>>>>> straightforward solution than using a lineage for every >>>>>>>>>>>>>>> representation. I >>>>>>>>>>>>>>> believe this is why Benny suggested reverting to SQL parsing, >>>>>>>>>>>>>>> and I agree >>>>>>>>>>>>>>> with him. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regarding the Storage table identifier: Its design as a >>>>>>>>>>>>>>> *PartialIdentifier* with only namespace and name fields was >>>>>>>>>>>>>>> intentional, to avoid the *catalog_name* issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jan >>>>>>>>>>>>>>> On 19.09.24 23:16, Benny Chow wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If Spark added the storage table identifier to the MV, I'm >>>>>>>>>>>>>>> not sure how it could also add a full identifier to the Dremio >>>>>>>>>>>>>>> representation. Spark doesn't know what name Dremio used for >>>>>>>>>>>>>>> the catalog. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For the UX issue, I think Jan cleverly called it a >>>>>>>>>>>>>>> "PartialIdentifier" and not a "FullIdentifier" to indicate that >>>>>>>>>>>>>>> catalog >>>>>>>>>>>>>>> name is not even a property of the identifier. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Requirement 3 is for the view's SQL. I'm not sure there is >>>>>>>>>>>>>>> a very strong use case to put the storage table into a >>>>>>>>>>>>>>> different catalog >>>>>>>>>>>>>>> than the view. If we had an engine agnostic solution for it, >>>>>>>>>>>>>>> I'm all for >>>>>>>>>>>>>>> it though... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:56 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think the solution for the storage identifier might be >>>>>>>>>>>>>>>> shared with the end state solution for the lineage. One could >>>>>>>>>>>>>>>> imagine a >>>>>>>>>>>>>>>> "full identifier" can be used for the storage table; however, >>>>>>>>>>>>>>>> it is >>>>>>>>>>>>>>>> "representation"-dependent (i.e., it changes according to >>>>>>>>>>>>>>>> which representation it is part of, or rather which engine >>>>>>>>>>>>>>>> uses it). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, are we asking engines (or their Iceberg >>>>>>>>>>>>>>>> implementation) to throw an exception if the full storage >>>>>>>>>>>>>>>> table identifier >>>>>>>>>>>>>>>> was provided as part of the MV definition? Sounds like a not >>>>>>>>>>>>>>>> very ideal UX. >>>>>>>>>>>>>>>> Note that it also conflicts with the spirit of requirement #3. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 10:02 AM Benny Chow < >>>>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Jan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> "PartialIdentifier" without the catalog name sounds good >>>>>>>>>>>>>>>>> to me. The storage table and MV have to be in the same >>>>>>>>>>>>>>>>> catalog. That >>>>>>>>>>>>>>>>> would be a good fifth requirement to add to the list. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Sep 19, 2024 at 1:27 AM Jan Kaul >>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Cool, I guess it's easier to resolve these kind of things >>>>>>>>>>>>>>>>>> when talking in person. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I agree with your requirements and the conclusion to use >>>>>>>>>>>>>>>>>> a map from UUID to snapshot-id/version-id as the >>>>>>>>>>>>>>>>>> refresh-state, as well as >>>>>>>>>>>>>>>>>> dropping the lineage in favor to just re-parsing the SQL >>>>>>>>>>>>>>>>>> query. This gets >>>>>>>>>>>>>>>>>> us around the "catalog alias" issue. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And I'm also OK with every engine requiring their own >>>>>>>>>>>>>>>>>> representation to use the MV. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There is still the issue with the identifier of the >>>>>>>>>>>>>>>>>> storage table and its catalog_name. Should we use an >>>>>>>>>>>>>>>>>> "PartialIdentifier" >>>>>>>>>>>>>>>>>> with a namespace and a name field, like so: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> namespace: ["bronze"], >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> name: "lineitem" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And require the storage table to be in the same catalog >>>>>>>>>>>>>>>>>> as the MV itself? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Jan >>>>>>>>>>>>>>>>>> On 19.09.24 00:50, Benny Chow wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Steven and I met up yesterday at the Seattle Iceberg >>>>>>>>>>>>>>>>>> meetup and we got to talking about the "catalog alias" >>>>>>>>>>>>>>>>>> issue. He described >>>>>>>>>>>>>>>>>> it as an annoying problem =p >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I think there are some key requirements we need to >>>>>>>>>>>>>>>>>> support: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. Different engines can produce and consume shared MVs >>>>>>>>>>>>>>>>>> with freshness validation. >>>>>>>>>>>>>>>>>> 2. We cannot force different engines to standardize on >>>>>>>>>>>>>>>>>> the alias they use for the catalog. >>>>>>>>>>>>>>>>>> 3. We cannot force different SQL representations to >>>>>>>>>>>>>>>>>> exclude catalog names from table identifiers or not use >>>>>>>>>>>>>>>>>> fully qualified >>>>>>>>>>>>>>>>>> table names. >>>>>>>>>>>>>>>>>> 4. MV SQL can join tables and views from multiple >>>>>>>>>>>>>>>>>> catalogs -> Inevitable with Nessie, Polaris, Unity, Tabular >>>>>>>>>>>>>>>>>> and others... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The producing engine has to save refresh state >>>>>>>>>>>>>>>>>> information to let consuming engine know that table X is at >>>>>>>>>>>>>>>>>> what snapshot >>>>>>>>>>>>>>>>>> at the time of materialization. The only way to identify >>>>>>>>>>>>>>>>>> this table across >>>>>>>>>>>>>>>>>> different catalog names is to use the cross catalog, >>>>>>>>>>>>>>>>>> globally unique UUID. >>>>>>>>>>>>>>>>>> I think our only option is to have the refresh state map >>>>>>>>>>>>>>>>>> UUID to snapshot >>>>>>>>>>>>>>>>>> ids and view version ids. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Assuming the above is how we store the refresh state, how >>>>>>>>>>>>>>>>>> does the consuming engine determine the current snapshot >>>>>>>>>>>>>>>>>> ids? The >>>>>>>>>>>>>>>>>> consuming engine will have to fully expand the query tree at >>>>>>>>>>>>>>>>>> which point it >>>>>>>>>>>>>>>>>> will have the UUIDs as well as the latest snapshot ids/view >>>>>>>>>>>>>>>>>> versions. This >>>>>>>>>>>>>>>>>> can then be diffed against the materialization refresh state >>>>>>>>>>>>>>>>>> to determine >>>>>>>>>>>>>>>>>> freshness. There isn't a need to store the view lineage >>>>>>>>>>>>>>>>>> information to map >>>>>>>>>>>>>>>>>> from UUID to the consumer specific identifier so that the >>>>>>>>>>>>>>>>>> consumer can then >>>>>>>>>>>>>>>>>> call back into the catalog with that identifier to get the >>>>>>>>>>>>>>>>>> latest state. >>>>>>>>>>>>>>>>>> The consuming engine might as well just re-parse the SQL and >>>>>>>>>>>>>>>>>> expand the >>>>>>>>>>>>>>>>>> query. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Personally, I'm OK with requiring that an engine must >>>>>>>>>>>>>>>>>> have its own SQL representation in order to use the MV. To >>>>>>>>>>>>>>>>>> me, being able >>>>>>>>>>>>>>>>>> to fulfill the key requirements above is much more important. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, Sep 14, 2024 at 2:01 AM Jan Kaul >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>