Correction of typo: both engines seem to set default-catalog to the view catalog if it is defined, or to null if the view catalog is not defined.
On Mon, Apr 28, 2025 at 3:06 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Dan, > > Thanks again for your response. > > I agree that catalog renaming is an environmental event, but it's a real > one that happens frequently in practice. > Saying that the Iceberg spec cannot accommodate something as common as > catalog renaming feels very restrictive, and could make the spec less > practical, even unusable, for real-world deployments. > I’m sharing this from the perspective of a large data lake environment > where views are heavily deployed and operationalized. > > Further, it's worth noting that the table spec is resilient to catalog > renaming, but the view spec is not. If we have an opportunity to make the > view spec similarly resilient, I wonder why not? > Both specifications are deterministic in their definition, but one is more > fragile to environmental changes than the other. Improving resilience does > not sacrifice determinism. It simply makes views safer and more portable > over time. > > Separately, given that there is no SQL construct today to explicitly set > default-catalog at creation time, what is the intuition behind how engines > like Spark and Trino currently assign default-catalog? > Today, both engines seem to set default-catalog to null if the view > catalog is defined, or to the view catalog if not. > What was the intended thought process behind this behavior? > > Thanks, > Walaa > > On Mon, Apr 28, 2025 at 1:33 PM Daniel Weeks <dwe...@apache.org> wrote: > >> Walaa, >> >> > tables inside views remain reachable after a catalog rename >> >> This problem stems from the exact environmental/configuration issue that >> we should not be trying to address. I don't think we would expect >> references to survive a catalog rename. That's not something covered by >> the spec and needs to be handled separately as a platform-level migration >> specific to the affected environment. >> >> The identifier resolution logic is clear and deterministic. It should >> not matter whether an engine resolves and encodes the default-catalog or >> leaves it to the resolution rules. >> >> The issue isn't with how the spec is defined, but rather view behavior >> when you start altering the environment around it, which isn't something we >> should be trying to define here. >> >> -Dan >> >> On Mon, Apr 28, 2025 at 12:17 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Hi Dan, >>> >>> Thanks for chiming in. >>> >>> I believe the issues we’re seeing now go beyond just catalog naming >>> consistency. The behavior around default-catalog itself introduces >>> resolution inconsistencies even when catalog names are consistent. >>> For example: >>> >>> * When default-catalog is set to null, tables inside views remain >>> reachable after a catalog rename. But if it is set to a non-null value, >>> table references will break. >>> >>> * default-catalog causes table references inside views to be early bound >>> (i.e., bound at view creation time, especially when using a non-null >>> value), while table references inside standalone queries are late bound >>> (bound at query time). This creates inconsistencies when resolving the same >>> table name inside and outside views, even within the same job. >>> >>> * It causes Spark's and Trino behavior to drift from the spec. There is >>> no way to fully align Spark's behavior without making invasive changes to >>> the Spark SQL grammar and the View DataSource API (specifically on the >>> CREATE side). This challenge would extend to other engines too. Both Spark >>> and Trino set this field based on a heuristic in today's implementation. >>> >>> * With view nesting (views depending on views), these inconsistencies >>> amplify further, forcing users and engines to reason about catalog >>> resolution at every level in the view tree. >>> >>> * It will be difficult to migrate Hive views to Iceberg with that model. >>> Migrated Hive views will have to unfollow that spec. >>> >>> How would you suggest approaching the engine-level changes required to >>> support the current default-catalog field? >>> Also, do you believe the Spark and Trino communities would align around >>> having table resolution behave inconsistently between queries and views, or >>> inconsistency between Iceberg and other types of views? >>> >>> Thanks, >>> Walaa >>> >>> >>> On Mon, Apr 28, 2025 at 11:34 AM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> I would agree with Jan's summary of why 'default-catalog' was >>>> introduced, but I think we need to step back and align on what we are >>>> really attempting to support in the spec. >>>> >>>> The issues we're discussing largely stem from using multiple engines >>>> with cross catalog references and configurations where catalog names are >>>> not aligned. If we have multiple engines that all have the same catalog >>>> names/configurations, the current spec implementation is well defined for >>>> table resolution even across catalogs. The 'default-catalog' (and >>>> namespace equivalent) was intended to address the resolution within the >>>> context of the sql text, not to address catalog/naming inconsistencies. >>>> >>>> I feel like we're trying to adapt the original intent to address the >>>> catalog naming/configuration and would argue that we shouldn't attempt to >>>> do that as part of the spec. Inconsistently named catalogs are a reality, >>>> but we should consider that a configuration/environmental issue, not >>>> something to solve for in the spec. >>>> >>>> We should support and advocate for consistency in catalog naming and >>>> define the spec along those lines. The fact is that with all of the recent >>>> work that's gone into making catalogs pluggable, it makes more sense to >>>> just register catalog configuration with consistent names (even if you have >>>> to duplicate the configuration for supporting existing readers/writers). I >>>> think it's better to provide a path toward consistency than to normalize >>>> complicated schemes to workaround the issues caused by >>>> environmental/configuration inconsistencies. >>>> >>>> If the goal is to create clever ways to hack the late binding >>>> resolution to swap in different catalogs or make references contextual, I >>>> feel like that is something we should strongly discourage as it leads to >>>> confusion about what is resolved as part of the query. >>>> >>>> At this point, I don't see a good argument to add >>>> additional configuration or change the resolution behaviors. >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Mon, Apr 28, 2025 at 12:40 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> wrote: >>>> >>>>> I think the intention with the "default-catalog" was that every query >>>>> engine uses it to store its session default catalog at the time of >>>>> creating >>>>> the view. This way the view could be reused in another session. The idea >>>>> was not to introduce an additional SQL syntax to set the default-catalog. >>>>> >>>>> Generally we have different environments we want to support with the >>>>> view spec: >>>>> >>>>> 1. Consistent catalog naming >>>>> >>>>> When the environment supports it, using consistent catalog names can >>>>> have a great benefit for multi-catalog, multi-engine setups. With >>>>> consistent catalog names, using the "default-catalog" field works without >>>>> any issues. >>>>> >>>>> 2. Inconsistent catalog naming >>>>> >>>>> This can be the case when different query engines refer to the same >>>>> physical catalog by different names. This often happens because different >>>>> query engines use different strategies to setup the catalogs. If catalogs >>>>> have inconsistent naming, using the "default-catalog" field does not work >>>>> because it is not guaranteed that the catalog name can be resolved with >>>>> another engine. Using the "view catalog" as a fallback is a better >>>>> solution >>>>> for this use case, as it avoids catalog names altogether. It is however >>>>> limited to table references in the same catalog. >>>>> >>>>> >>>>> What do you think of introducing a view property that specifies if the >>>>> "default-catalog" or the "view catalog" should be used? This way, you >>>>> could >>>>> use the "default-catalog" in environments where you can guarantee >>>>> consistent naming, but you would be able to directly fallback to the >>>>> "view-catalog" when you don't have consistent naming. The query engines >>>>> could set the default for this view property at creation time. Spark for >>>>> example could set it to automatically use the "view catalog". >>>>> >>>>> Thanks >>>>> >>>>> Jan >>>>> >>>>> >>>>> On 4/26/25 05:33, Walaa Eldin Moustafa wrote: >>>>> >>>>> To help folks catch up on the latest discussions and interpretation of >>>>> the spec, I have summarized everything we discussed so far at the top of >>>>> the proposal document (here >>>>> <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>). >>>>> I have slightly updated the proposal to be in sync with the new >>>>> interpretation to avoid confusion. In summary: >>>>> >>>>> * Remove default-catalog and default-namespace fields from the view >>>>> spec completely. >>>>> >>>>> * Hence, we do not attempt to define separate view-level default >>>>> catalogs or namespaces. >>>>> >>>>> Instead: >>>>> >>>>> * If a table identifier inside a view lacks a catalog qualifier, >>>>> engines should resolve it using the current engine catalog at query time. >>>>> >>>>> * Reference table identifiers in the metadata exactly as they appear >>>>> in the view SQL text. >>>>> >>>>> * If an identifier lacks the catalog part at creation, it should still >>>>> lack a catalog in the stored metadata. >>>>> >>>>> * Store UUIDs alongside table identifiers whenever possible. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> Thanks for the contribution Benny! +1 to the confusion the fallback >>>>>> creates. Also just to be clear, at this point and after clarifying the >>>>>> current spec intentions, I am convinced that we should remove the default >>>>>> catalog and default namespace fields altogether. >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> wrote: >>>>>> >>>>>>> I'd like to contribute my opinions on this: >>>>>>> >>>>>>> - I don't particularly like the current behavior of "default to the >>>>>>> view's catalog when default-catalog is not set". Fundamentally, I >>>>>>> believe >>>>>>> the intent of default-catalog and default-namespace is there to help >>>>>>> users >>>>>>> write more concise SQL. >>>>>>> - spark session catalog is engine specific and I don't think we >>>>>>> should design something that says first use this catalog, then that >>>>>>> catalog.. or that catalog. For example, resolving identifiers using >>>>>>> default-catalog -> view's catalog -> session catalog is not good. >>>>>>> - We gotta support non-Iceberg tables otherwise I see no value in >>>>>>> putting views in the catalog to share with other engines >>>>>>> - Interoperability between different engine types is very hard due >>>>>>> to dialect issues... so I think we should focus on supporting different >>>>>>> clusters of the same engine type on a shared catalog. For example, AI >>>>>>> and >>>>>>> BI clusters on Spark sharing the same views in a REST catalog. >>>>>>> >>>>>>> Coincidentally, I think the ultimate solution is along the lines of >>>>>>> something Russell proposed last year: >>>>>>> >>>>>>> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 >>>>>>> >>>>>>> We've been looking at this interoperable identifier problem through >>>>>>> the lens of catalog resolution but maybe the right approach is really >>>>>>> about >>>>>>> templating. >>>>>>> >>>>>>> I would extend Russell's idea to allow identifiers in a view to span >>>>>>> catalogs to support non-Iceberg tables. Also, the default-catalog >>>>>>> property could be templated as well. >>>>>>> >>>>>>> Thoughts? >>>>>>> Benny >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Steven! How do you recommend making Spark implementation >>>>>>>> conform to the spec? Do we need Spark SQL extensions and/or Spark >>>>>>>> catalog >>>>>>>> APIs for that? >>>>>>>> >>>>>>>> How do you recommend reconciling the inconsistencies I shared >>>>>>>> regarding many resolution methods not consistently being followed in >>>>>>>> different scenarios (view vs child table resolution, query vs view >>>>>>>> resolution)? Note these occur when the default catalog is set to a >>>>>>>> non-null >>>>>>>> value. If it helps, I can share concrete examples. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The core issue is on the fall back behavior when `default-catalog` >>>>>>>>> is >>>>>>>>> not defined. Current view spec says the fallback should be the >>>>>>>>> catalog >>>>>>>>> where the view is defined. It doesn't really matter what the >>>>>>>>> catalog >>>>>>>>> is named (catalogX) by the read engine. >>>>>>>>> - If a view refers to the tables in the same catalog, this is a >>>>>>>>> non-ambiguous and reasonable fallback behavior. >>>>>>>>> - If a view refers to tables from another catalog, catalog names >>>>>>>>> should be included in the reference name already. So no ambiguity >>>>>>>>> there either. >>>>>>>>> >>>>>>>>> Potential inconsistent naming of catalog is a separate problem, >>>>>>>>> which >>>>>>>>> Iceberg view spec probably cannot solve. We can only recommend that >>>>>>>>> catalog should be named consistently across usage for better >>>>>>>>> interoperability on name references. >>>>>>>>> >>>>>>>>> This proposal is to change the fallback behavior to engine's >>>>>>>>> session >>>>>>>>> default catalog. I am not sure it is better than the current >>>>>>>>> fallback >>>>>>>>> behavior. >>>>>>>>> >>>>>>>>> > Today’s Spark behavior explicitly differs from this idea. Spark >>>>>>>>> resolves table identifiers during view creation using the session’s >>>>>>>>> default >>>>>>>>> catalog, not a supplied `default-catalog`. >>>>>>>>> >>>>>>>>> I would argue that is a Spark implementation issue for not >>>>>>>>> conforming >>>>>>>>> to the spec. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >>>>>>>>> <wa.moust...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > Hi Jan, >>>>>>>>> > >>>>>>>>> > Thanks again for continuing the discussion. I want to highlight >>>>>>>>> a few fundamental issues around the interpretation of default-catalog: >>>>>>>>> > >>>>>>>>> > Here is the real catch: >>>>>>>>> > >>>>>>>>> > * default-catalog cannot logically be defined at view creation >>>>>>>>> time. It would be circular: the view needs to exist before its >>>>>>>>> metadata >>>>>>>>> (and hence default-catalog) can exist. This is visible in Spark’s >>>>>>>>> implementation, where `default-catalog` is not used. >>>>>>>>> > >>>>>>>>> > * Introducing a creation-time default-catalog setting would >>>>>>>>> require extending SQL syntax and engine APIs to promote it to a >>>>>>>>> first-class >>>>>>>>> view concept. This would be intrusive, non-intuitive, and >>>>>>>>> realistically >>>>>>>>> very difficult to standardize across engines. >>>>>>>>> > >>>>>>>>> > * Today’s Spark behavior explicitly differs from this idea. >>>>>>>>> Spark resolves table identifiers during view creation using the >>>>>>>>> session’s >>>>>>>>> default catalog, not a supplied `default-catalog`. >>>>>>>>> > >>>>>>>>> > * Hypothetically even if we patched in a creation-time >>>>>>>>> default-catalog, it would create an inconsistent binding model between >>>>>>>>> tables vs views (early vs late), and between tables in views and in >>>>>>>>> queries >>>>>>>>> (again early vs late). For example, views and tables in queries can >>>>>>>>> withstand default catalog renames, but tables cannot when they are >>>>>>>>> used >>>>>>>>> inside views -- it even applies to views inside views, which makes >>>>>>>>> this >>>>>>>>> very hard to reason about considering nesting. >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > Walaa >>>>>>>>> > >>>>>>>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >> >>>>>>>>> >> @Walaa: >>>>>>>>> >> >>>>>>>>> >> I would argue that when you run a CREATE VIEW statement the >>>>>>>>> query engine knowns which catalog the view is being created in. So >>>>>>>>> even >>>>>>>>> though we typically use late binding to resolve the view catalog at >>>>>>>>> query >>>>>>>>> time, it can also be used at creation time. >>>>>>>>> >> >>>>>>>>> >> The query engine would need to keep track of the "view catalog" >>>>>>>>> where the view is going to be created in. It can use that catalog to >>>>>>>>> resolve partial table identifiers if "default-catalog" is not set. >>>>>>>>> >> >>>>>>>>> >> It can lead to some unintuitive behavior, where partial >>>>>>>>> identifiers in the view query resolve to a different catalog compared >>>>>>>>> to >>>>>>>>> using them outside of a view. >>>>>>>>> >> >>>>>>>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>>>>>>>> sales.orders; >>>>>>>>> >> >>>>>>>>> >> If the session default catalog is not "catalogA", the >>>>>>>>> "sales.orders" in the view query would not be the same as just >>>>>>>>> referencing >>>>>>>>> "sales.orders" in a normal SQL statement. This is because without a >>>>>>>>> "default-catalog", the catalog name of "sales.orders" would default to >>>>>>>>> "catalogA", which is the view's catalog. >>>>>>>>> >> >>>>>>>>> >> Thanks, >>>>>>>>> >> >>>>>>>>> >> Jan >>>>>>>>> >> >>>>>>>>> >> On 4/25/25 04:05, Manu Zhang wrote: >>>>>>>>> >>> >>>>>>>>> >>> For example, if we want to validate that the tables referenced >>>>>>>>> in the view exist, how can we do that when default-catalog isn't >>>>>>>>> defined, >>>>>>>>> since the view hasn't been created or loaded yet? >>>>>>>>> >> >>>>>>>>> >> I don't think this is related to view spec. How do we validate >>>>>>>>> that a table exists without a default catalog, or do we always use the >>>>>>>>> current session catalog? >>>>>>>>> >> >>>>>>>>> >> Thanks, >>>>>>>>> >> Manu >>>>>>>>> >> >>>>>>>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>> >>>>>>>>> >>> Hi Jan, >>>>>>>>> >>> >>>>>>>>> >>> I think we still share the same understanding. Just to >>>>>>>>> clarify: when I referred to late binding as “similar” to the >>>>>>>>> proposal, I >>>>>>>>> was acknowledging the distinction between view-level and table-level >>>>>>>>> resolution. But as you noted, both follow a late binding model. >>>>>>>>> >>> >>>>>>>>> >>> That said, this still raises an interesting question and a >>>>>>>>> potential gap: if default-catalog is only defined at query time, how >>>>>>>>> should >>>>>>>>> resolution work during view creation? For example, if we want to >>>>>>>>> validate >>>>>>>>> that the tables referenced in the view exist, how can we do that when >>>>>>>>> default-catalog isn't defined, since the view hasn't been created or >>>>>>>>> loaded >>>>>>>>> yet? >>>>>>>>> >>> >>>>>>>>> >>> Thanks, >>>>>>>>> >>> Walaa. >>>>>>>>> >>> >>>>>>>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>> >>>>>>>>> >>>> Yes, I have the same understanding. The view catalog is >>>>>>>>> resolved at query time. >>>>>>>>> >>>> >>>>>>>>> >>>> As you mentioned before, it's good to distinguish between the >>>>>>>>> physical catalog and it's reference used in SQL statements. The >>>>>>>>> important >>>>>>>>> part is that the physical catalog of the view and the tables >>>>>>>>> referenced in >>>>>>>>> it's definition stay consistent. You could create a view in a given >>>>>>>>> physical catalog by referring to it as "catalogA", as in your first >>>>>>>>> point. >>>>>>>>> If you then, given a different setup, refer to the same physical >>>>>>>>> catalog as >>>>>>>>> "catalogB" in another session/environment, the behavior should still >>>>>>>>> work. >>>>>>>>> >>>> >>>>>>>>> >>>> I would however rephrase your last point. Late binding >>>>>>>>> applies to the view catalog name and by extension to all partial table >>>>>>>>> references when no "default-catalog" is present. Resolving the view >>>>>>>>> catalog >>>>>>>>> name at query time is not opposed to storing the view metadata in a >>>>>>>>> catalog. >>>>>>>>> >>>> >>>>>>>>> >>>> Or maybe I don't entirely understand what you mean. >>>>>>>>> >>>> >>>>>>>>> >>>> Thanks >>>>>>>>> >>>> >>>>>>>>> >>>> Jan >>>>>>>>> >>>> >>>>>>>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>> >>>>>>>>> >>>> Hi Jan, >>>>>>>>> >>>> >>>>>>>>> >>>> > The view is executed when it's being referenced in a SQL >>>>>>>>> statement. That statement contains the information for the query >>>>>>>>> engine to >>>>>>>>> resolve the catalog of the view. >>>>>>>>> >>>> >>>>>>>>> >>>> If I’m understanding correctly, that means: >>>>>>>>> >>>> >>>>>>>>> >>>> * If the view is queried as SELECT * FROM >>>>>>>>> catalogA.namespace.view, then catalogA is considered the view’s >>>>>>>>> catalog. >>>>>>>>> >>>> >>>>>>>>> >>>> * If the same view is later queried as SELECT * FROM >>>>>>>>> catalogB.namespace.view (after renaming catalogA to catalogB, and >>>>>>>>> keeping >>>>>>>>> everything else the same), then catalogB becomes the view’s catalog. >>>>>>>>> >>>> >>>>>>>>> >>>> Is that interpretation correct? If so, it sounds to me like >>>>>>>>> the catalog is resolved at query time, based on how the view is >>>>>>>>> referenced, >>>>>>>>> not from any stored metadata. That would imply some sort of a late >>>>>>>>> binding >>>>>>>>> behavior (similar to the proposal), as opposed to using some catalog >>>>>>>>> that >>>>>>>>> "stores" the view definition. >>>>>>>>> >>>> >>>>>>>>> >>>> Thanks, >>>>>>>>> >>>> Walaa >>>>>>>>> >>>> >>>>>>>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> Hi Walaa, >>>>>>>>> >>>>> >>>>>>>>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me >>>>>>>>> try to address your questions. >>>>>>>>> >>>>> >>>>>>>>> >>>>> 1. This is my interpretation of the current spec: The view >>>>>>>>> is executed when it's being referenced in a SQL statement. That >>>>>>>>> statement >>>>>>>>> contains the information for the query engine to resolve the catalog >>>>>>>>> of the >>>>>>>>> view. The query engine then uses that information to fetch the view >>>>>>>>> metadata from the catalog. It also needs to temporarily keep track of >>>>>>>>> which >>>>>>>>> catalog it used to fetch the view metadata. It can then use that >>>>>>>>> information to resolve the table references in the views SQL >>>>>>>>> definition in >>>>>>>>> case no default catalog is specified. >>>>>>>>> >>>>> >>>>>>>>> >>>>> 2. The important part is that the catalog can be referenced >>>>>>>>> at execution time. As long as that's the case I would assume the view >>>>>>>>> can >>>>>>>>> be created in any catalog. >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> I think your point is really valuable because the current >>>>>>>>> specification can lead to some unintuitive behavior. For example for >>>>>>>>> the >>>>>>>>> following statement: >>>>>>>>> >>>>> >>>>>>>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>>>>>>>> sales.orders; >>>>>>>>> >>>>> >>>>>>>>> >>>>> If the session default catalog is not "catalogA", the >>>>>>>>> "sales.orders" in the view query would not be the same as just >>>>>>>>> referencing >>>>>>>>> "sales.orders" in a normal SQL statement. This is because without a >>>>>>>>> "default-catalog", the catalog name of "sales.orders" would default to >>>>>>>>> "catalogA". >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> However, I like the current design of the view spec, because >>>>>>>>> it has the "closure" property. Because of the fact that the "view >>>>>>>>> catalog" >>>>>>>>> has to be known when executing a view, all the information required to >>>>>>>>> resolve the table identifiers is contained in the view metadata (and >>>>>>>>> the >>>>>>>>> "view catalog"). I think that if you make the identifier resolution >>>>>>>>> dependent on external parameters, it hinders portability. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Thanks, >>>>>>>>> >>>>> >>>>>>>>> >>>>> Jan >>>>>>>>> >>>>> >>>>>>>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>> >>>>>>>>> >>>>> Hi Jan, >>>>>>>>> >>>>> >>>>>>>>> >>>>> Thanks for the thoughtful feedback. >>>>>>>>> >>>>> >>>>>>>>> >>>>> I think it’s important we clarify a key point before going >>>>>>>>> deeper: >>>>>>>>> >>>>> >>>>>>>>> >>>>> Non-determinism is not caused by session fallback >>>>>>>>> behavior—it’s a fundamental limitation of using table identifiers >>>>>>>>> alone, >>>>>>>>> regardless of whether we use the current rule, the proposed fallback >>>>>>>>> to the >>>>>>>>> session’s default catalog, or even early vs. late binding. >>>>>>>>> >>>>> >>>>>>>>> >>>>> The same fully qualified identifier (e.g., >>>>>>>>> catalogA.namespace.table) can resolve to different objects depending >>>>>>>>> solely >>>>>>>>> on engine-specific routing logic or catalog aliases. So determinism >>>>>>>>> isn’t >>>>>>>>> guaranteed just because an identifier is "fully qualified." The only >>>>>>>>> reliable anchor for identity is the UUID. That’s why the proposed use >>>>>>>>> of >>>>>>>>> UUIDs is not just a hardening strategy. It’s the actual fix for >>>>>>>>> correctness. >>>>>>>>> >>>>> >>>>>>>>> >>>>> To move the conversation forward, could you help clarify two >>>>>>>>> things in the context of the current spec: >>>>>>>>> >>>>> >>>>>>>>> >>>>> * Where in the metadata is the “view catalog” stored, so >>>>>>>>> that an engine knows to fall back to it if default-catalog is null? >>>>>>>>> >>>>> >>>>>>>>> >>>>> * Are we even allowed to create views in the session's >>>>>>>>> default catalog (i.e., without specifying a catalog) in the current >>>>>>>>> Iceberg >>>>>>>>> spec? >>>>>>>>> >>>>> >>>>>>>>> >>>>> These questions are important because if we can’t >>>>>>>>> unambiguously recover the "view catalog" from metadata, then >>>>>>>>> defaulting to >>>>>>>>> it is problematic. And if views can't be created in the default >>>>>>>>> catalog, >>>>>>>>> then the fallback rule doesn’t generalize. >>>>>>>>> >>>>> >>>>>>>>> >>>>> Thanks, >>>>>>>>> >>>>> Walaa. >>>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote: >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Hi Walaa, >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> thank you for your proposal. If I understood correctly, you >>>>>>>>> proposal is composed of three parts: >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> - session default catalog as fallback for "default-catalog" >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> - session default namespace as fallback for >>>>>>>>> "default-namepace" >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> - Late binding + UUID validation >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> I have some comments regarding these points. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> 1. Session default catalog as fallback for "default-catalog" >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Introducing a behavior that depends on the current session >>>>>>>>> setup is in my opinion the definition of "non-determinism". You could >>>>>>>>> be >>>>>>>>> running the same query-engine and catalog-setup on different days, >>>>>>>>> with >>>>>>>>> different default session catalogs (which is rather common), and >>>>>>>>> would be >>>>>>>>> getting different results. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Whereas with the current behavior, the view always produces >>>>>>>>> the same results. The current behavior has some rough edges in very >>>>>>>>> niche >>>>>>>>> use cases but I think is solid for most uses cases. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> 2. Session default namespace as fallback for >>>>>>>>> "default-namespace" >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Similar to the above. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> 3. Late binding + UUID validation >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> If I understand it correctly, the current implementation >>>>>>>>> already uses late binding. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Generally, having UUID validation makes the setup more >>>>>>>>> robust. Which is great. However, having UUID validation still >>>>>>>>> requires us >>>>>>>>> to have a portable table identifier specification. Even if we have the >>>>>>>>> UUIDs of the referenced tables from the view, there simply isn't an >>>>>>>>> interface that let's us use those UUIDs. The catalog interface is >>>>>>>>> defined >>>>>>>>> in terms of table identifiers. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> So we always require a working catalog setup and suiting >>>>>>>>> table identifiers to obtain the table metadata. We can use the UUIDs >>>>>>>>> to >>>>>>>>> verify if we loaded the correct table. But this can only be done >>>>>>>>> after we >>>>>>>>> used some identifier. Which means there is no way of using UUIDs >>>>>>>>> without a >>>>>>>>> functioning catalog/identifier setup. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> In conclusion, I prefer the current behavior for >>>>>>>>> "default-catalog" because it is more deterministic in my opinion. And >>>>>>>>> I >>>>>>>>> think the current spec does a good job for multi-engine table >>>>>>>>> identifier >>>>>>>>> resolution. I see the UUID validation more of an additional hardening >>>>>>>>> strategy. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Thanks >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Jan >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Thanks Renjie! >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> The existing spec has some guidance on resolving catalogs >>>>>>>>> on the fly already (to address the case of view text with table >>>>>>>>> identifiers >>>>>>>>> missing the catalog part). The guidance is to use the catalog where >>>>>>>>> the >>>>>>>>> view is stored. But I find this rule hard to interpret or use. The >>>>>>>>> catalog >>>>>>>>> itself is a logical construct—such as a federated catalog that >>>>>>>>> delegates to >>>>>>>>> multiple physical backends (e.g., HMS and REST). In such cases, the >>>>>>>>> catalog >>>>>>>>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t >>>>>>>>> physically >>>>>>>>> store the tables; it only routes requests to underlying stores. >>>>>>>>> Therefore, >>>>>>>>> defaulting identifier resolution based on the catalog where the view >>>>>>>>> is >>>>>>>>> "stored" doesn’t align with how catalogs actually behave in practice. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> Thanks, >>>>>>>>> >>>>>> Walaa. >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Hi, Walaa: >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> Thanks for the proposal. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> I've reviewed the doc, but in general I have some concerns >>>>>>>>> with resolving catalog names on the fly with query engine defined >>>>>>>>> catalog >>>>>>>>> names. This introduces some flexibility at first glance, but also >>>>>>>>> makes >>>>>>>>> misconfiguration difficult to explain. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> But I agree with one part that we should store resolved >>>>>>>>> table uuid in view metadata, as table/view renaming may introduce >>>>>>>>> errors >>>>>>>>> that's difficult to understand for user. >>>>>>>>> >>>>>>> >>>>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Looking forward to keeping up the momentum and closing >>>>>>>>> out the MV spec as well. I’m hoping we can proceed to a vote next >>>>>>>>> week. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Here is a summary in case that helps. The proposal >>>>>>>>> outlines a strategy for handling table identifiers in Iceberg view >>>>>>>>> metadata, with the goal of ensuring correctness, portability, and >>>>>>>>> engine >>>>>>>>> compatibility. It recommends resolving table identifiers at read time >>>>>>>>> (late >>>>>>>>> binding) rather than creation time, and introduces UUID-based >>>>>>>>> validation to >>>>>>>>> maintain identity guarantees across engines, or sessions. It also >>>>>>>>> revises >>>>>>>>> how default-catalog and default-namespace are handled (defaulting >>>>>>>>> both to >>>>>>>>> the session context if not explicitly set) to better align with engine >>>>>>>>> behavior and improve cross-engine interoperability. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Please let me know your thoughts. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> Thanks, >>>>>>>>> >>>>>>>> Walaa. >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> One key point to keep in mind is that catalog names in >>>>>>>>> the spec refer to logical catalogs—i.e., the first part of a >>>>>>>>> three-part >>>>>>>>> identifier. These correspond to Spark's DataSourceV2 catalogs, Trino >>>>>>>>> connectors, and similar constructs. This is a level of abstraction >>>>>>>>> above >>>>>>>>> physical catalogs, which are not referenced or used in the view spec. >>>>>>>>> The >>>>>>>>> reason is that table identifiers in the view definition/text itself >>>>>>>>> refer >>>>>>>>> to logical catalogs, not physical ones (since they interface directly >>>>>>>>> with >>>>>>>>> the engine and not a specific metastore). >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun < >>>>>>>>> sungwy...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view >>>>>>>>> portability is a very important topic for us to continue discussing >>>>>>>>> as it >>>>>>>>> relies on many assumptions within the data ecosystem for it to >>>>>>>>> function >>>>>>>>> like you've highlighted well in the document. >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> I've added a few comments around how this may impact >>>>>>>>> the permission questions the engines will be asking, and whether that >>>>>>>>> is >>>>>>>>> the desired behavior. >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> Sung >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner < >>>>>>>>> etudenhoef...@apache.org> wrote: >>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a >>>>>>>>> few comments to get a better understanding of how this will look like >>>>>>>>> in >>>>>>>>> the actual implementation. >>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>>>> Eduard >>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> Starting this thread to resume our discussion on how >>>>>>>>> to reference table identifiers from Iceberg metadata, a key aspect of >>>>>>>>> the >>>>>>>>> view specification, particularly in relation to the MV (materialized >>>>>>>>> view) >>>>>>>>> extensions. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> I had the chance to speak offline with a few >>>>>>>>> community members to better understand how the current spec is being >>>>>>>>> interpreted. Those conversations served as inputs to a new proposal >>>>>>>>> on how >>>>>>>>> table identifier references could be represented in metadata. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to >>>>>>>>> your feedback and working together to move this forward so we can >>>>>>>>> finalize >>>>>>>>> the MV spec as well. >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>