Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Walaa Eldin Moustafa Mon, 28 Apr 2025 15:20:33 -0700

Correction of typo: both engines seem to set default-catalog to the view
catalog if it is defined, or to null if the view catalog is not defined.


On Mon, Apr 28, 2025 at 3:06 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Dan,
>
> Thanks again for your response.
>
> I agree that catalog renaming is an environmental event, but it's a real
> one that happens frequently in practice.
> Saying that the Iceberg spec cannot accommodate something as common as
> catalog renaming feels very restrictive, and could make the spec less
> practical, even unusable, for real-world deployments.
> I’m sharing this from the perspective of a large data lake environment
> where views are heavily deployed and operationalized.
>
> Further, it's worth noting that the table spec is resilient to catalog
> renaming, but the view spec is not. If we have an opportunity to make the
> view spec similarly resilient, I wonder why not?
> Both specifications are deterministic in their definition, but one is more
> fragile to environmental changes than the other. Improving resilience does
> not sacrifice determinism. It simply makes views safer and more portable
> over time.
>
> Separately, given that there is no SQL construct today to explicitly set
> default-catalog at creation time, what is the intuition behind how engines
> like Spark and Trino currently assign default-catalog?
> Today, both engines seem to set default-catalog to null if the view
> catalog is defined, or to the view catalog if not.
> What was the intended thought process behind this behavior?
>
> Thanks,
> Walaa
>
> On Mon, Apr 28, 2025 at 1:33 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Walaa,
>>
>> > tables inside views remain reachable after a catalog rename
>>
>> This problem stems from the exact environmental/configuration issue that
>> we should not be trying to address.  I don't think we would expect
>> references to survive a catalog rename.  That's not something covered by
>> the spec and needs to be handled separately as a platform-level migration
>> specific to the affected environment.
>>
>> The identifier resolution logic is clear and deterministic.  It should
>> not matter whether an engine resolves and encodes the default-catalog or
>> leaves it to the resolution rules.
>>
>> The issue isn't with how the spec is defined, but rather view behavior
>> when you start altering the environment around it, which isn't something we
>> should be trying to define here.
>>
>> -Dan
>>
>> On Mon, Apr 28, 2025 at 12:17 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Dan,
>>>
>>> Thanks for chiming in.
>>>
>>> I believe the issues we’re seeing now go beyond just catalog naming
>>> consistency. The behavior around default-catalog itself introduces
>>> resolution inconsistencies even when catalog names are consistent.
>>> For example:
>>>
>>> * When default-catalog is set to null, tables inside views remain
>>> reachable after a catalog rename. But if it is set to a non-null value,
>>> table references will break.
>>>
>>> * default-catalog causes table references inside views to be early bound
>>> (i.e., bound at view creation time, especially when using a non-null
>>> value), while table references inside standalone queries are late bound
>>> (bound at query time). This creates inconsistencies when resolving the same
>>> table name inside and outside views, even within the same job.
>>>
>>> * It causes Spark's and Trino behavior to drift from the spec. There is
>>> no way to fully align Spark's behavior without making invasive changes to
>>> the Spark SQL grammar and the View DataSource API (specifically on the
>>> CREATE side). This challenge would extend to other engines too. Both Spark
>>> and Trino set this field based on a heuristic in today's implementation.
>>>
>>> * With view nesting (views depending on views), these inconsistencies
>>> amplify further, forcing users and engines to reason about catalog
>>> resolution at every level in the view tree.
>>>
>>> * It will be difficult to migrate Hive views to Iceberg with that model.
>>> Migrated Hive views will have to unfollow that spec.
>>>
>>> How would you suggest approaching the engine-level changes required to
>>> support the current default-catalog field?
>>> Also, do you believe the Spark and Trino communities would align around
>>> having table resolution behave inconsistently between queries and views, or
>>> inconsistency between Iceberg and other types of views?
>>>
>>> Thanks,
>>> Walaa
>>>
>>>
>>> On Mon, Apr 28, 2025 at 11:34 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> I would agree with Jan's summary of why 'default-catalog' was
>>>> introduced, but I think we need to step back and align on what we are
>>>> really attempting to support in the spec.
>>>>
>>>> The issues we're discussing largely stem from using multiple engines
>>>> with cross catalog references and configurations where catalog names are
>>>> not aligned.  If we have multiple engines that all have the same catalog
>>>> names/configurations, the current spec implementation is well defined for
>>>> table resolution even across catalogs.  The 'default-catalog' (and
>>>> namespace equivalent) was intended to address the resolution within the
>>>> context of the sql text, not to address catalog/naming inconsistencies.
>>>>
>>>> I feel like we're trying to adapt the original intent to address the
>>>> catalog naming/configuration and would argue that we shouldn't attempt to
>>>> do that as part of the spec.  Inconsistently named catalogs are a reality,
>>>> but we should consider that a configuration/environmental issue, not
>>>> something to solve for in the spec.
>>>>
>>>> We should support and advocate for consistency in catalog naming and
>>>> define the spec along those lines.  The fact is that with all of the recent
>>>> work that's gone into making catalogs pluggable, it makes more sense to
>>>> just register catalog configuration with consistent names (even if you have
>>>> to duplicate the configuration for supporting existing readers/writers).  I
>>>> think it's better to provide a path toward consistency than to normalize
>>>> complicated schemes to workaround the issues caused by
>>>> environmental/configuration inconsistencies.
>>>>
>>>> If the goal is to create clever ways to hack the late binding
>>>> resolution to swap in different catalogs or make references contextual, I
>>>> feel like that is something we should strongly discourage as it leads to
>>>> confusion about what is resolved as part of the query.
>>>>
>>>> At this point, I don't see a good argument to add
>>>> additional configuration or change the resolution behaviors.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2025 at 12:40 AM Jan Kaul <jank...@mailbox.org.invalid>
>>>> wrote:
>>>>
>>>>> I think the intention with the "default-catalog" was that every query
>>>>> engine uses it to store its session default catalog at the time of 
>>>>> creating
>>>>> the view. This way the view could be reused in another session. The idea
>>>>> was not to introduce an additional SQL syntax to set the default-catalog.
>>>>>
>>>>> Generally we have different environments we want to support with the
>>>>> view spec:
>>>>>
>>>>> 1. Consistent catalog naming
>>>>>
>>>>> When the environment supports it, using consistent catalog names can
>>>>> have a great benefit for multi-catalog, multi-engine setups. With
>>>>> consistent catalog names, using the "default-catalog" field works without
>>>>> any issues.
>>>>>
>>>>> 2. Inconsistent catalog naming
>>>>>
>>>>> This can be the case when different query engines refer to the same
>>>>> physical catalog by different names. This often happens because different
>>>>> query engines use different strategies to setup the catalogs. If catalogs
>>>>> have inconsistent naming, using the "default-catalog" field does not work
>>>>> because it is not guaranteed that the catalog name can be resolved with
>>>>> another engine. Using the "view catalog" as a fallback is a better 
>>>>> solution
>>>>> for this use case, as it avoids catalog names altogether. It is however
>>>>> limited to table references in the same catalog.
>>>>>
>>>>>
>>>>> What do you think of introducing a view property that specifies if the
>>>>> "default-catalog" or the "view catalog" should be used? This way, you 
>>>>> could
>>>>> use the "default-catalog" in environments where you can guarantee
>>>>> consistent naming, but you would be able to directly fallback to the
>>>>> "view-catalog" when you don't have consistent naming. The query engines
>>>>> could set the default for this view property at creation time. Spark for
>>>>> example could set it to automatically use the "view catalog".
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jan
>>>>>
>>>>>
>>>>> On 4/26/25 05:33, Walaa Eldin Moustafa wrote:
>>>>>
>>>>> To help folks catch up on the latest discussions and interpretation of
>>>>> the spec, I have summarized everything we discussed so far at the top of
>>>>> the proposal document (here
>>>>> <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>).
>>>>> I have slightly updated the proposal to be in sync with the new
>>>>> interpretation to avoid confusion. In summary:
>>>>>
>>>>> * Remove default-catalog and default-namespace fields from the view
>>>>> spec completely.
>>>>>
>>>>> * Hence, we do not attempt to define separate view-level default
>>>>> catalogs or namespaces.
>>>>>
>>>>> Instead:
>>>>>
>>>>> * If a table identifier inside a view lacks a catalog qualifier,
>>>>> engines should resolve it using the current engine catalog at query time.
>>>>>
>>>>> * Reference table identifiers in the metadata exactly as they appear
>>>>> in the view SQL text.
>>>>>
>>>>> * If an identifier lacks the catalog part at creation, it should still
>>>>> lack a catalog in the stored metadata.
>>>>>
>>>>> * Store UUIDs alongside table identifiers whenever possible.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the contribution Benny! +1 to the confusion the fallback
>>>>>> creates. Also just to be clear, at this point and after clarifying the
>>>>>> current spec intentions, I am convinced that we should remove the default
>>>>>> catalog and default namespace fields altogether.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> wrote:
>>>>>>
>>>>>>> I'd like to contribute my opinions on this:
>>>>>>>
>>>>>>> - I don't particularly like the current behavior of "default to the
>>>>>>> view's catalog when default-catalog is not set".  Fundamentally, I 
>>>>>>> believe
>>>>>>> the intent of default-catalog and default-namespace is there to help 
>>>>>>> users
>>>>>>> write more concise SQL.
>>>>>>> - spark session catalog is engine specific and I don't think we
>>>>>>> should design something that says first use this catalog, then that
>>>>>>> catalog.. or that catalog.  For example, resolving identifiers using
>>>>>>> default-catalog -> view's catalog -> session catalog is not good.
>>>>>>> - We gotta support non-Iceberg tables otherwise I see no value in
>>>>>>> putting views in the catalog to share with other engines
>>>>>>> - Interoperability between different engine types is very hard due
>>>>>>> to dialect issues... so I think we should focus on supporting different
>>>>>>> clusters of the same engine type on a shared catalog.  For example, AI 
>>>>>>> and
>>>>>>> BI clusters on Spark sharing the same views in a REST catalog.
>>>>>>>
>>>>>>> Coincidentally, I think the ultimate solution is along the lines of
>>>>>>> something Russell proposed last year:
>>>>>>>
>>>>>>> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7
>>>>>>>
>>>>>>> We've been looking at this interoperable identifier problem through
>>>>>>> the lens of catalog resolution but maybe the right approach is really 
>>>>>>> about
>>>>>>> templating.
>>>>>>>
>>>>>>> I would extend Russell's idea to allow identifiers in a view to span
>>>>>>> catalogs to support non-Iceberg tables.   Also, the default-catalog
>>>>>>> property could be templated as well.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>> Benny
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa <
>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Steven! How do you recommend making Spark implementation
>>>>>>>> conform to the spec? Do we need Spark SQL extensions and/or Spark 
>>>>>>>> catalog
>>>>>>>> APIs for that?
>>>>>>>>
>>>>>>>> How do you recommend reconciling the inconsistencies I shared
>>>>>>>> regarding many resolution methods not consistently being followed in
>>>>>>>> different scenarios (view vs child table resolution, query vs view
>>>>>>>> resolution)? Note these occur when the default catalog is set to a 
>>>>>>>> non-null
>>>>>>>> value. If it helps, I can share concrete examples.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The core issue is on the fall back behavior when `default-catalog`
>>>>>>>>> is
>>>>>>>>> not defined. Current view spec says the fallback should be the
>>>>>>>>> catalog
>>>>>>>>> where the view is defined. It doesn't really matter what the
>>>>>>>>> catalog
>>>>>>>>> is named (catalogX) by the read engine.
>>>>>>>>> - If a view refers to the tables in the same catalog, this is a
>>>>>>>>> non-ambiguous and reasonable fallback behavior.
>>>>>>>>> - If a view refers to tables from another catalog, catalog names
>>>>>>>>> should be included in the reference name already. So no ambiguity
>>>>>>>>> there either.
>>>>>>>>>
>>>>>>>>> Potential inconsistent naming of catalog is a separate problem,
>>>>>>>>> which
>>>>>>>>> Iceberg view spec probably cannot solve. We can only recommend that
>>>>>>>>> catalog should be named consistently across usage for better
>>>>>>>>> interoperability on name references.
>>>>>>>>>
>>>>>>>>> This proposal is to change the fallback behavior to engine's
>>>>>>>>> session
>>>>>>>>> default catalog. I am not sure it is better than the current
>>>>>>>>> fallback
>>>>>>>>> behavior.
>>>>>>>>>
>>>>>>>>> > Today’s Spark behavior explicitly differs from this idea. Spark
>>>>>>>>> resolves table identifiers during view creation using the session’s 
>>>>>>>>> default
>>>>>>>>> catalog, not a supplied `default-catalog`.
>>>>>>>>>
>>>>>>>>> I would argue that is a Spark implementation issue for not
>>>>>>>>> conforming
>>>>>>>>> to the spec.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
>>>>>>>>> <wa.moust...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi Jan,
>>>>>>>>> >
>>>>>>>>> > Thanks again for continuing the discussion. I want to highlight
>>>>>>>>> a few fundamental issues around the interpretation of default-catalog:
>>>>>>>>> >
>>>>>>>>> > Here is the real catch:
>>>>>>>>> >
>>>>>>>>> > * default-catalog cannot logically be defined at view creation
>>>>>>>>> time. It would be circular: the view needs to exist before its 
>>>>>>>>> metadata
>>>>>>>>> (and hence default-catalog) can exist. This is visible in Spark’s
>>>>>>>>> implementation, where `default-catalog` is not used.
>>>>>>>>> >
>>>>>>>>> > * Introducing a creation-time default-catalog setting would
>>>>>>>>> require extending SQL syntax and engine APIs to promote it to a 
>>>>>>>>> first-class
>>>>>>>>> view concept. This would be intrusive, non-intuitive, and 
>>>>>>>>> realistically
>>>>>>>>> very difficult to standardize across engines.
>>>>>>>>> >
>>>>>>>>> > * Today’s Spark behavior explicitly differs from this idea.
>>>>>>>>> Spark resolves table identifiers during view creation using the 
>>>>>>>>> session’s
>>>>>>>>> default catalog, not a supplied `default-catalog`.
>>>>>>>>> >
>>>>>>>>> > * Hypothetically even if we patched in a creation-time
>>>>>>>>> default-catalog, it would create an inconsistent binding model between
>>>>>>>>> tables vs views (early vs late), and between tables in views and in 
>>>>>>>>> queries
>>>>>>>>> (again early vs late). For example, views and tables in queries can
>>>>>>>>> withstand default catalog renames, but tables cannot when they are 
>>>>>>>>> used
>>>>>>>>> inside views -- it even applies to views inside views, which makes 
>>>>>>>>> this
>>>>>>>>> very hard to reason about considering nesting.
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Walaa
>>>>>>>>> >
>>>>>>>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>> >>
>>>>>>>>> >> @Walaa:
>>>>>>>>> >>
>>>>>>>>> >> I would argue that when you run a CREATE VIEW statement the
>>>>>>>>> query engine knowns which catalog the view is being created in. So 
>>>>>>>>> even
>>>>>>>>> though we typically use late binding to resolve the view catalog at 
>>>>>>>>> query
>>>>>>>>> time, it can also be used at creation time.
>>>>>>>>> >>
>>>>>>>>> >> The query engine would need to keep track of the "view catalog"
>>>>>>>>> where the view is going to be created in. It can use that catalog to
>>>>>>>>> resolve partial table identifiers if "default-catalog" is not set.
>>>>>>>>> >>
>>>>>>>>> >> It can lead to some unintuitive behavior, where partial
>>>>>>>>> identifiers in the view query resolve to a different catalog compared 
>>>>>>>>> to
>>>>>>>>> using them outside of a view.
>>>>>>>>> >>
>>>>>>>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>>>>>>>>> sales.orders;
>>>>>>>>> >>
>>>>>>>>> >> If the session default catalog is not "catalogA", the
>>>>>>>>> "sales.orders" in the view query would not be the same as just 
>>>>>>>>> referencing
>>>>>>>>> "sales.orders" in a normal SQL statement. This is because without a
>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would default to
>>>>>>>>> "catalogA", which is the view's catalog.
>>>>>>>>> >>
>>>>>>>>> >> Thanks,
>>>>>>>>> >>
>>>>>>>>> >> Jan
>>>>>>>>> >>
>>>>>>>>> >> On 4/25/25 04:05, Manu Zhang wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> For example, if we want to validate that the tables referenced
>>>>>>>>> in the view exist, how can we do that when default-catalog isn't 
>>>>>>>>> defined,
>>>>>>>>> since the view hasn't been created or loaded yet?
>>>>>>>>> >>
>>>>>>>>> >> I don't think this is related to view spec. How do we validate
>>>>>>>>> that a table exists without a default catalog, or do we always use the
>>>>>>>>> current session catalog?
>>>>>>>>> >>
>>>>>>>>> >> Thanks,
>>>>>>>>> >> Manu
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Hi Jan,
>>>>>>>>> >>>
>>>>>>>>> >>> I think we still share the same understanding. Just to
>>>>>>>>> clarify: when I referred to late binding as “similar” to the 
>>>>>>>>> proposal, I
>>>>>>>>> was acknowledging the distinction between view-level and table-level
>>>>>>>>> resolution. But as you noted, both follow a late binding model.
>>>>>>>>> >>>
>>>>>>>>> >>> That said, this still raises an interesting question and a
>>>>>>>>> potential gap: if default-catalog is only defined at query time, how 
>>>>>>>>> should
>>>>>>>>> resolution work during view creation? For example, if we want to 
>>>>>>>>> validate
>>>>>>>>> that the tables referenced in the view exist, how can we do that when
>>>>>>>>> default-catalog isn't defined, since the view hasn't been created or 
>>>>>>>>> loaded
>>>>>>>>> yet?
>>>>>>>>> >>>
>>>>>>>>> >>> Thanks,
>>>>>>>>> >>> Walaa.
>>>>>>>>> >>>
>>>>>>>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Yes, I have the same understanding. The view catalog is
>>>>>>>>> resolved at query time.
>>>>>>>>> >>>>
>>>>>>>>> >>>> As you mentioned before, it's good to distinguish between the
>>>>>>>>> physical catalog and it's reference used in SQL statements. The 
>>>>>>>>> important
>>>>>>>>> part is that the physical catalog of the view and the tables 
>>>>>>>>> referenced in
>>>>>>>>> it's definition stay consistent. You could create a view in a given
>>>>>>>>> physical catalog by referring to it as "catalogA", as in your first 
>>>>>>>>> point.
>>>>>>>>> If you then, given a different setup, refer to the same physical 
>>>>>>>>> catalog as
>>>>>>>>> "catalogB" in another session/environment, the behavior should still 
>>>>>>>>> work.
>>>>>>>>> >>>>
>>>>>>>>> >>>> I would however rephrase your last point. Late binding
>>>>>>>>> applies to the view catalog name and by extension to all partial table
>>>>>>>>> references when no "default-catalog" is present. Resolving the view 
>>>>>>>>> catalog
>>>>>>>>> name at query time is not opposed to storing the view metadata in a 
>>>>>>>>> catalog.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Or maybe I don't entirely understand what you mean.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks
>>>>>>>>> >>>>
>>>>>>>>> >>>> Jan
>>>>>>>>> >>>>
>>>>>>>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Hi Jan,
>>>>>>>>> >>>>
>>>>>>>>> >>>> > The view is executed when it's being referenced in a SQL
>>>>>>>>> statement. That statement contains the information for the query 
>>>>>>>>> engine to
>>>>>>>>> resolve the catalog of the view.
>>>>>>>>> >>>>
>>>>>>>>> >>>> If I’m understanding correctly, that means:
>>>>>>>>> >>>>
>>>>>>>>> >>>> * If the view is queried as SELECT * FROM
>>>>>>>>> catalogA.namespace.view, then catalogA is considered the view’s 
>>>>>>>>> catalog.
>>>>>>>>> >>>>
>>>>>>>>> >>>> * If the same view is later queried as SELECT * FROM
>>>>>>>>> catalogB.namespace.view (after renaming catalogA to catalogB, and 
>>>>>>>>> keeping
>>>>>>>>> everything else the same), then catalogB becomes the view’s catalog.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Is that interpretation correct? If so, it sounds to me like
>>>>>>>>> the catalog is resolved at query time, based on how the view is 
>>>>>>>>> referenced,
>>>>>>>>> not from any stored metadata. That would imply some sort of a late 
>>>>>>>>> binding
>>>>>>>>> behavior (similar to the proposal), as opposed to using some catalog 
>>>>>>>>> that
>>>>>>>>> "stores" the view definition.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks,
>>>>>>>>> >>>> Walaa
>>>>>>>>> >>>>
>>>>>>>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Hi Walaa,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me
>>>>>>>>> try to address your questions.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> 1. This is my interpretation of the current spec: The view
>>>>>>>>> is executed when it's being referenced in a SQL statement. That 
>>>>>>>>> statement
>>>>>>>>> contains the information for the query engine to resolve the catalog 
>>>>>>>>> of the
>>>>>>>>> view. The query engine then uses that information to fetch the view
>>>>>>>>> metadata from the catalog. It also needs to temporarily keep track of 
>>>>>>>>> which
>>>>>>>>> catalog it used to fetch the view metadata. It can then use that
>>>>>>>>> information to resolve the table references in the views SQL 
>>>>>>>>> definition in
>>>>>>>>> case no default catalog is specified.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> 2. The important part is that the catalog can be referenced
>>>>>>>>> at execution time. As long as that's the case I would assume the view 
>>>>>>>>> can
>>>>>>>>> be created in any catalog.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> I think your point is really valuable because the current
>>>>>>>>> specification can lead to some unintuitive behavior. For example for 
>>>>>>>>> the
>>>>>>>>> following statement:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>>>>>>>>> sales.orders;
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> If the session default catalog is not "catalogA", the
>>>>>>>>> "sales.orders" in the view query would not be the same as just 
>>>>>>>>> referencing
>>>>>>>>> "sales.orders" in a normal SQL statement. This is because without a
>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would default to
>>>>>>>>> "catalogA".
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> However, I like the current design of the view spec, because
>>>>>>>>> it has the "closure" property. Because of the fact that the "view 
>>>>>>>>> catalog"
>>>>>>>>> has to be known when executing a view, all the information required to
>>>>>>>>> resolve the table identifiers is contained in the view metadata (and 
>>>>>>>>> the
>>>>>>>>> "view catalog"). I think that if you make the identifier resolution
>>>>>>>>> dependent on external parameters, it hinders portability.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Thanks,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Jan
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Hi Jan,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Thanks for the thoughtful feedback.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> I think it’s important we clarify a key point before going
>>>>>>>>> deeper:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Non-determinism is not caused by session fallback
>>>>>>>>> behavior—it’s a fundamental limitation of using table identifiers 
>>>>>>>>> alone,
>>>>>>>>> regardless of whether we use the current rule, the proposed fallback 
>>>>>>>>> to the
>>>>>>>>> session’s default catalog, or even early vs. late binding.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> The same fully qualified identifier (e.g.,
>>>>>>>>> catalogA.namespace.table) can resolve to different objects depending 
>>>>>>>>> solely
>>>>>>>>> on engine-specific routing logic or catalog aliases. So determinism 
>>>>>>>>> isn’t
>>>>>>>>> guaranteed just because an identifier is "fully qualified." The only
>>>>>>>>> reliable anchor for identity is the UUID. That’s why the proposed use 
>>>>>>>>> of
>>>>>>>>> UUIDs is not just a hardening strategy. It’s the actual fix for 
>>>>>>>>> correctness.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> To move the conversation forward, could you help clarify two
>>>>>>>>> things in the context of the current spec:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> * Where in the metadata is the “view catalog” stored, so
>>>>>>>>> that an engine knows to fall back to it if default-catalog is null?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> * Are we even allowed to create views in the session's
>>>>>>>>> default catalog (i.e., without specifying a catalog) in the current 
>>>>>>>>> Iceberg
>>>>>>>>> spec?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> These questions are important because if we can’t
>>>>>>>>> unambiguously recover the "view catalog" from metadata, then 
>>>>>>>>> defaulting to
>>>>>>>>> it is problematic. And if views can't be created in the default 
>>>>>>>>> catalog,
>>>>>>>>> then the fallback rule doesn’t generalize.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Thanks,
>>>>>>>>> >>>>> Walaa.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Hi Walaa,
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> thank you for your proposal. If I understood correctly, you
>>>>>>>>> proposal is composed of three parts:
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> - session default catalog as fallback for "default-catalog"
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> - session default namespace as fallback for
>>>>>>>>> "default-namepace"
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> - Late binding + UUID validation
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> I have some comments regarding these points.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> 1. Session default catalog as fallback for "default-catalog"
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Introducing a behavior that depends on the current session
>>>>>>>>> setup is in my opinion the definition of "non-determinism". You could 
>>>>>>>>> be
>>>>>>>>> running the same query-engine and catalog-setup on different days, 
>>>>>>>>> with
>>>>>>>>> different default session catalogs (which is rather common), and 
>>>>>>>>> would be
>>>>>>>>> getting different results.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Whereas with the current behavior, the view always produces
>>>>>>>>> the same results. The current behavior has some rough edges in very 
>>>>>>>>> niche
>>>>>>>>> use cases but I think is solid for most uses cases.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> 2. Session default namespace as fallback for
>>>>>>>>> "default-namespace"
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Similar to the above.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> 3. Late binding + UUID validation
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> If I understand it correctly, the current implementation
>>>>>>>>> already uses late binding.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Generally, having UUID validation makes the setup more
>>>>>>>>> robust. Which is great. However, having UUID validation still 
>>>>>>>>> requires us
>>>>>>>>> to have a portable table identifier specification. Even if we have the
>>>>>>>>> UUIDs of the referenced tables from the view, there simply isn't an
>>>>>>>>> interface that let's us use those UUIDs. The catalog interface is 
>>>>>>>>> defined
>>>>>>>>> in terms of table identifiers.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> So we always require a working catalog setup and suiting
>>>>>>>>> table identifiers to obtain the table metadata. We can use the UUIDs 
>>>>>>>>> to
>>>>>>>>> verify if we loaded the correct table. But this can only be done 
>>>>>>>>> after we
>>>>>>>>> used some identifier. Which means there is no way of using UUIDs 
>>>>>>>>> without a
>>>>>>>>> functioning catalog/identifier setup.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> In conclusion, I prefer the current behavior for
>>>>>>>>> "default-catalog" because it is more deterministic in my opinion. And 
>>>>>>>>> I
>>>>>>>>> think the current spec does a good job for multi-engine table 
>>>>>>>>> identifier
>>>>>>>>> resolution. I see the UUID validation more of an additional hardening
>>>>>>>>> strategy.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Thanks
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Jan
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Thanks Renjie!
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> The existing spec has some guidance on resolving catalogs
>>>>>>>>> on the fly already (to address the case of view text with table 
>>>>>>>>> identifiers
>>>>>>>>> missing the catalog part). The guidance is to use the catalog where 
>>>>>>>>> the
>>>>>>>>> view is stored. But I find this rule hard to interpret or use. The 
>>>>>>>>> catalog
>>>>>>>>> itself is a logical construct—such as a federated catalog that 
>>>>>>>>> delegates to
>>>>>>>>> multiple physical backends (e.g., HMS and REST). In such cases, the 
>>>>>>>>> catalog
>>>>>>>>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t 
>>>>>>>>> physically
>>>>>>>>> store the tables; it only routes requests to underlying stores. 
>>>>>>>>> Therefore,
>>>>>>>>> defaulting identifier resolution based on the catalog where the view 
>>>>>>>>> is
>>>>>>>>> "stored" doesn’t align with how catalogs actually behave in practice.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> Thanks,
>>>>>>>>> >>>>>> Walaa.
>>>>>>>>> >>>>>>
>>>>>>>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <
>>>>>>>>> liurenjie2...@gmail.com> wrote:
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Hi, Walaa:
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> Thanks for the proposal.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> I've reviewed the doc, but in general I have some concerns
>>>>>>>>> with resolving catalog names on the fly with query engine defined 
>>>>>>>>> catalog
>>>>>>>>> names. This introduces some flexibility at first glance, but also 
>>>>>>>>> makes
>>>>>>>>> misconfiguration difficult to explain.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> But I agree with one part that we should store resolved
>>>>>>>>> table uuid in view metadata, as table/view renaming may introduce 
>>>>>>>>> errors
>>>>>>>>> that's difficult to understand for user.
>>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Hi Everyone,
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Looking forward to keeping up the momentum and closing
>>>>>>>>> out the MV spec as well. I’m hoping we can proceed to a vote next 
>>>>>>>>> week.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Here is a summary in case that helps. The proposal
>>>>>>>>> outlines a strategy for handling table identifiers in Iceberg view
>>>>>>>>> metadata, with the goal of ensuring correctness, portability, and 
>>>>>>>>> engine
>>>>>>>>> compatibility. It recommends resolving table identifiers at read time 
>>>>>>>>> (late
>>>>>>>>> binding) rather than creation time, and introduces UUID-based 
>>>>>>>>> validation to
>>>>>>>>> maintain identity guarantees across engines, or sessions. It also 
>>>>>>>>> revises
>>>>>>>>> how default-catalog and default-namespace are handled (defaulting 
>>>>>>>>> both to
>>>>>>>>> the session context if not explicitly set) to better align with engine
>>>>>>>>> behavior and improve cross-engine interoperability.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Please let me know your thoughts.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments.
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> One key point to keep in mind is that catalog names in
>>>>>>>>> the spec refer to logical catalogs—i.e., the first part of a 
>>>>>>>>> three-part
>>>>>>>>> identifier. These correspond to Spark's DataSourceV2 catalogs, Trino
>>>>>>>>> connectors, and similar constructs. This is a level of abstraction 
>>>>>>>>> above
>>>>>>>>> physical catalogs, which are not referenced or used in the view spec. 
>>>>>>>>> The
>>>>>>>>> reason is that table identifiers in the view definition/text itself 
>>>>>>>>> refer
>>>>>>>>> to logical catalogs, not physical ones (since they interface directly 
>>>>>>>>> with
>>>>>>>>> the engine and not a specific metastore).
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> Thanks,
>>>>>>>>> >>>>>>>>> Walaa.
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>>
>>>>>>>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <
>>>>>>>>> sungwy...@gmail.com> wrote:
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view
>>>>>>>>> portability is a very important topic for us to continue discussing 
>>>>>>>>> as it
>>>>>>>>> relies on many assumptions within the data ecosystem for it to 
>>>>>>>>> function
>>>>>>>>> like you've highlighted well in the document.
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>> I've added a few comments around how this may impact
>>>>>>>>> the permission questions the engines will be asking, and whether that 
>>>>>>>>> is
>>>>>>>>> the desired behavior.
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>> Sung
>>>>>>>>> >>>>>>>>>>
>>>>>>>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner <
>>>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>>> >>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a
>>>>>>>>> few comments to get a better understanding of how this will look like 
>>>>>>>>> in
>>>>>>>>> the actual implementation.
>>>>>>>>> >>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>> Eduard
>>>>>>>>> >>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> Hi Everyone,
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> Starting this thread to resume our discussion on how
>>>>>>>>> to reference table identifiers from Iceberg metadata, a key aspect of 
>>>>>>>>> the
>>>>>>>>> view specification, particularly in relation to the MV (materialized 
>>>>>>>>> view)
>>>>>>>>> extensions.
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> I had the chance to speak offline with a few
>>>>>>>>> community members to better understand how the current spec is being
>>>>>>>>> interpreted. Those conversations served as inputs to a new proposal 
>>>>>>>>> on how
>>>>>>>>> table identifier references could be represented in metadata.
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to
>>>>>>>>> your feedback and working together to move this forward so we can 
>>>>>>>>> finalize
>>>>>>>>> the MV spec as well.
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>> >>>>>>>>>>>> Thanks,
>>>>>>>>> >>>>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to