Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Walaa Eldin Moustafa Wed, 30 Apr 2025 17:12:09 -0700

> I think that's the lesser evil compared to Iceberg specifying how engines
should resolve identifiers


I think this is also similar to the previous point. It is the other way
around. Right now the spec dictates how to resolve (through employing a
view-specific `default-catalog` field). The proposal is suggesting to get
out of this space and let engines handle it similar to how they handle all
identifiers.

On Wed, Apr 30, 2025 at 5:07 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> > I thought "default-catalog" could be set via the USE command.
>
> Benny, I think this is a misconception or miscommunication. The USE
> command has no impact on the `default-catalog` field. In fact, the
> proposal's direction is exactly to establish that USE command should
> influence how tables are resolved, same like everywhere else. Right now it
> is not the case under the current spec.
>
>
> On Wed, Apr 30, 2025 at 3:17 PM Benny Chow <btc...@gmail.com> wrote:
>
>> > there is no SQL construct today to explicitly set default-catalog
>>
>> I thought "default-catalog" could be set via the USE command.
>>
>> I generally agree with Dan about requiring consistent catalog names.  I
>> think that's the lesser evil compared to Iceberg specifying how engines
>> should resolve identifiers.  Another thing to consider is that identifier
>> resolution can be very expensive at query validation time if identifiers
>> need to be looked up from a bunch of places.  Hopefully, it should be
>> possible to define a view in such a way that identifiers can be resolved on
>> the first try.
>>
>> Benny
>>
>> On Tue, Apr 29, 2025 at 10:29 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Rishabh,
>>>
>>> You're right that the proposal touches on two aspects, and resolution
>>> rules are one of them. The other aspect is the proposal's position that
>>> table identifiers should be stored in metadata exactly as they appear in
>>> the view text (e.g., even if they're two-part or partially qualified),
>>> along with their corresponding UUIDs for validation. This applies both to
>>> referenced input tables and the storage table identifier in materialized
>>> views.
>>>
>>> We may be able to converge on this storage format even if we haven't yet
>>> converged on the resolution fallback rules. I believe both resolution
>>> strategies currently being discussed would still lead to storing
>>> identifiers in this way.
>>>
>>> I'm supportive of moving forward with consensus on the identifier
>>> storage format. That said, we may continue to run into questions related to
>>> resolution during implementation. For example: Should the storage table
>>> identifier follow the same default-catalog and default-namespace resolution
>>> behavior as other table references?
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Tue, Apr 29, 2025 at 10:07 PM Rishabh Bhatia <
>>> bhatiarishab...@gmail.com> wrote:
>>>
>>>> Hello Walaa,
>>>>
>>>> Thanks for starting this discussion.
>>>>
>>>> I think we should decouple at least the MV Spec from the proposal to
>>>> change the current behavior of view resolution.
>>>>
>>>> We can continue having the discussion if the current view spec needs to
>>>> be changed or not. Based on the decision at a later point if required we
>>>> can update the view resolution rule.
>>>>
>>>>
>>>> Thanks,
>>>> Rishabh
>>>>
>>>> On Mon, Apr 28, 2025 at 3:22 PM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Correction of typo: both engines seem to set default-catalog to the
>>>>> view catalog if it is defined, or to null if the view catalog is not
>>>>> defined.
>>>>>
>>>>> On Mon, Apr 28, 2025 at 3:06 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Hi Dan,
>>>>>>
>>>>>> Thanks again for your response.
>>>>>>
>>>>>> I agree that catalog renaming is an environmental event, but it's a
>>>>>> real one that happens frequently in practice.
>>>>>> Saying that the Iceberg spec cannot accommodate something as common
>>>>>> as catalog renaming feels very restrictive, and could make the spec less
>>>>>> practical, even unusable, for real-world deployments.
>>>>>> I’m sharing this from the perspective of a large data lake
>>>>>> environment where views are heavily deployed and operationalized.
>>>>>>
>>>>>> Further, it's worth noting that the table spec is resilient to
>>>>>> catalog renaming, but the view spec is not. If we have an opportunity to
>>>>>> make the view spec similarly resilient, I wonder why not?
>>>>>> Both specifications are deterministic in their definition, but one is
>>>>>> more fragile to environmental changes than the other. Improving 
>>>>>> resilience
>>>>>> does not sacrifice determinism. It simply makes views safer and more
>>>>>> portable over time.
>>>>>>
>>>>>> Separately, given that there is no SQL construct today to explicitly
>>>>>> set default-catalog at creation time, what is the intuition behind how
>>>>>> engines like Spark and Trino currently assign default-catalog?
>>>>>> Today, both engines seem to set default-catalog to null if the view
>>>>>> catalog is defined, or to the view catalog if not.
>>>>>> What was the intended thought process behind this behavior?
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa
>>>>>>
>>>>>> On Mon, Apr 28, 2025 at 1:33 PM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Walaa,
>>>>>>>
>>>>>>> > tables inside views remain reachable after a catalog rename
>>>>>>>
>>>>>>> This problem stems from the exact environmental/configuration issue
>>>>>>> that we should not be trying to address.  I don't think we would expect
>>>>>>> references to survive a catalog rename.  That's not something covered by
>>>>>>> the spec and needs to be handled separately as a platform-level 
>>>>>>> migration
>>>>>>> specific to the affected environment.
>>>>>>>
>>>>>>> The identifier resolution logic is clear and deterministic.  It
>>>>>>> should not matter whether an engine resolves and encodes the
>>>>>>> default-catalog or leaves it to the resolution rules.
>>>>>>>
>>>>>>> The issue isn't with how the spec is defined, but rather view
>>>>>>> behavior when you start altering the environment around it, which isn't
>>>>>>> something we should be trying to define here.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Mon, Apr 28, 2025 at 12:17 PM Walaa Eldin Moustafa <
>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> Thanks for chiming in.
>>>>>>>>
>>>>>>>> I believe the issues we’re seeing now go beyond just catalog naming
>>>>>>>> consistency. The behavior around default-catalog itself introduces
>>>>>>>> resolution inconsistencies even when catalog names are consistent.
>>>>>>>> For example:
>>>>>>>>
>>>>>>>> * When default-catalog is set to null, tables inside views remain
>>>>>>>> reachable after a catalog rename. But if it is set to a non-null value,
>>>>>>>> table references will break.
>>>>>>>>
>>>>>>>> * default-catalog causes table references inside views to be early
>>>>>>>> bound (i.e., bound at view creation time, especially when using a 
>>>>>>>> non-null
>>>>>>>> value), while table references inside standalone queries are late bound
>>>>>>>> (bound at query time). This creates inconsistencies when resolving the 
>>>>>>>> same
>>>>>>>> table name inside and outside views, even within the same job.
>>>>>>>>
>>>>>>>> * It causes Spark's and Trino behavior to drift from the spec.
>>>>>>>> There is no way to fully align Spark's behavior without making invasive
>>>>>>>> changes to the Spark SQL grammar and the View DataSource API 
>>>>>>>> (specifically
>>>>>>>> on the CREATE side). This challenge would extend to other engines too. 
>>>>>>>> Both
>>>>>>>> Spark and Trino set this field based on a heuristic in today's
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> * With view nesting (views depending on views), these
>>>>>>>> inconsistencies amplify further, forcing users and engines to reason 
>>>>>>>> about
>>>>>>>> catalog resolution at every level in the view tree.
>>>>>>>>
>>>>>>>> * It will be difficult to migrate Hive views to Iceberg with that
>>>>>>>> model. Migrated Hive views will have to unfollow that spec.
>>>>>>>>
>>>>>>>> How would you suggest approaching the engine-level changes required
>>>>>>>> to support the current default-catalog field?
>>>>>>>> Also, do you believe the Spark and Trino communities would align
>>>>>>>> around having table resolution behave inconsistently between queries 
>>>>>>>> and
>>>>>>>> views, or inconsistency between Iceberg and other types of views?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 28, 2025 at 11:34 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I would agree with Jan's summary of why 'default-catalog' was
>>>>>>>>> introduced, but I think we need to step back and align on what we are
>>>>>>>>> really attempting to support in the spec.
>>>>>>>>>
>>>>>>>>> The issues we're discussing largely stem from using multiple
>>>>>>>>> engines with cross catalog references and configurations where catalog
>>>>>>>>> names are not aligned.  If we have multiple engines that all have the 
>>>>>>>>> same
>>>>>>>>> catalog names/configurations, the current spec implementation is well
>>>>>>>>> defined for table resolution even across catalogs.  The 
>>>>>>>>> 'default-catalog'
>>>>>>>>> (and namespace equivalent) was intended to address the resolution 
>>>>>>>>> within
>>>>>>>>> the context of the sql text, not to address catalog/naming 
>>>>>>>>> inconsistencies.
>>>>>>>>>
>>>>>>>>> I feel like we're trying to adapt the original intent to address
>>>>>>>>> the catalog naming/configuration and would argue that we shouldn't 
>>>>>>>>> attempt
>>>>>>>>> to do that as part of the spec.  Inconsistently named catalogs are a
>>>>>>>>> reality, but we should consider that a configuration/environmental 
>>>>>>>>> issue,
>>>>>>>>> not something to solve for in the spec.
>>>>>>>>>
>>>>>>>>> We should support and advocate for consistency in catalog naming
>>>>>>>>> and define the spec along those lines.  The fact is that with all of 
>>>>>>>>> the
>>>>>>>>> recent work that's gone into making catalogs pluggable, it makes more 
>>>>>>>>> sense
>>>>>>>>> to just register catalog configuration with consistent names (even if 
>>>>>>>>> you
>>>>>>>>> have to duplicate the configuration for supporting existing
>>>>>>>>> readers/writers).  I think it's better to provide a path toward 
>>>>>>>>> consistency
>>>>>>>>> than to normalize complicated schemes to workaround the issues caused 
>>>>>>>>> by
>>>>>>>>> environmental/configuration inconsistencies.
>>>>>>>>>
>>>>>>>>> If the goal is to create clever ways to hack the late binding
>>>>>>>>> resolution to swap in different catalogs or make references 
>>>>>>>>> contextual, I
>>>>>>>>> feel like that is something we should strongly discourage as it leads 
>>>>>>>>> to
>>>>>>>>> confusion about what is resolved as part of the query.
>>>>>>>>>
>>>>>>>>> At this point, I don't see a good argument to add
>>>>>>>>> additional configuration or change the resolution behaviors.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 28, 2025 at 12:40 AM Jan Kaul
>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I think the intention with the "default-catalog" was that every
>>>>>>>>>> query engine uses it to store its session default catalog at the 
>>>>>>>>>> time of
>>>>>>>>>> creating the view. This way the view could be reused in another 
>>>>>>>>>> session.
>>>>>>>>>> The idea was not to introduce an additional SQL syntax to set the
>>>>>>>>>> default-catalog.
>>>>>>>>>>
>>>>>>>>>> Generally we have different environments we want to support with
>>>>>>>>>> the view spec:
>>>>>>>>>>
>>>>>>>>>> 1. Consistent catalog naming
>>>>>>>>>>
>>>>>>>>>> When the environment supports it, using consistent catalog names
>>>>>>>>>> can have a great benefit for multi-catalog, multi-engine setups. With
>>>>>>>>>> consistent catalog names, using the "default-catalog" field works 
>>>>>>>>>> without
>>>>>>>>>> any issues.
>>>>>>>>>>
>>>>>>>>>> 2. Inconsistent catalog naming
>>>>>>>>>>
>>>>>>>>>> This can be the case when different query engines refer to the
>>>>>>>>>> same physical catalog by different names. This often happens because
>>>>>>>>>> different query engines use different strategies to setup the 
>>>>>>>>>> catalogs. If
>>>>>>>>>> catalogs have inconsistent naming, using the "default-catalog" field 
>>>>>>>>>> does
>>>>>>>>>> not work because it is not guaranteed that the catalog name can be 
>>>>>>>>>> resolved
>>>>>>>>>> with another engine. Using the "view catalog" as a fallback is a 
>>>>>>>>>> better
>>>>>>>>>> solution for this use case, as it avoids catalog names altogether. 
>>>>>>>>>> It is
>>>>>>>>>> however limited to table references in the same catalog.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What do you think of introducing a view property that specifies
>>>>>>>>>> if the "default-catalog" or the "view catalog" should be used? This 
>>>>>>>>>> way,
>>>>>>>>>> you could use the "default-catalog" in environments where you can 
>>>>>>>>>> guarantee
>>>>>>>>>> consistent naming, but you would be able to directly fallback to the
>>>>>>>>>> "view-catalog" when you don't have consistent naming. The query 
>>>>>>>>>> engines
>>>>>>>>>> could set the default for this view property at creation time. Spark 
>>>>>>>>>> for
>>>>>>>>>> example could set it to automatically use the "view catalog".
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/26/25 05:33, Walaa Eldin Moustafa wrote:
>>>>>>>>>>
>>>>>>>>>> To help folks catch up on the latest discussions and
>>>>>>>>>> interpretation of the spec, I have summarized everything we 
>>>>>>>>>> discussed so
>>>>>>>>>> far at the top of the proposal document (here
>>>>>>>>>> <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>).
>>>>>>>>>> I have slightly updated the proposal to be in sync with the new
>>>>>>>>>> interpretation to avoid confusion. In summary:
>>>>>>>>>>
>>>>>>>>>> * Remove default-catalog and default-namespace fields from the
>>>>>>>>>> view spec completely.
>>>>>>>>>>
>>>>>>>>>> * Hence, we do not attempt to define separate view-level default
>>>>>>>>>> catalogs or namespaces.
>>>>>>>>>>
>>>>>>>>>> Instead:
>>>>>>>>>>
>>>>>>>>>> * If a table identifier inside a view lacks a catalog qualifier,
>>>>>>>>>> engines should resolve it using the current engine catalog at query 
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> * Reference table identifiers in the metadata exactly as they
>>>>>>>>>> appear in the view SQL text.
>>>>>>>>>>
>>>>>>>>>> * If an identifier lacks the catalog part at creation, it should
>>>>>>>>>> still lack a catalog in the stored metadata.
>>>>>>>>>>
>>>>>>>>>> * Store UUIDs alongside table identifiers whenever possible.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa <
>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the contribution Benny! +1 to the confusion the
>>>>>>>>>>> fallback creates. Also just to be clear, at this point and after 
>>>>>>>>>>> clarifying
>>>>>>>>>>> the current spec intentions, I am convinced that we should remove 
>>>>>>>>>>> the
>>>>>>>>>>> default catalog and default namespace fields altogether.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Walaa.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'd like to contribute my opinions on this:
>>>>>>>>>>>>
>>>>>>>>>>>> - I don't particularly like the current behavior of "default to
>>>>>>>>>>>> the view's catalog when default-catalog is not set".  
>>>>>>>>>>>> Fundamentally, I
>>>>>>>>>>>> believe the intent of default-catalog and default-namespace is 
>>>>>>>>>>>> there to
>>>>>>>>>>>> help users write more concise SQL.
>>>>>>>>>>>> - spark session catalog is engine specific and I don't think we
>>>>>>>>>>>> should design something that says first use this catalog, then that
>>>>>>>>>>>> catalog.. or that catalog.  For example, resolving identifiers 
>>>>>>>>>>>> using
>>>>>>>>>>>> default-catalog -> view's catalog -> session catalog is not good.
>>>>>>>>>>>> - We gotta support non-Iceberg tables otherwise I see no value
>>>>>>>>>>>> in putting views in the catalog to share with other engines
>>>>>>>>>>>> - Interoperability between different engine types is very hard
>>>>>>>>>>>> due to dialect issues... so I think we should focus on supporting 
>>>>>>>>>>>> different
>>>>>>>>>>>> clusters of the same engine type on a shared catalog.  For 
>>>>>>>>>>>> example, AI and
>>>>>>>>>>>> BI clusters on Spark sharing the same views in a REST catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> Coincidentally, I think the ultimate solution is along the
>>>>>>>>>>>> lines of something Russell proposed last year:
>>>>>>>>>>>>
>>>>>>>>>>>> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7
>>>>>>>>>>>>
>>>>>>>>>>>> We've been looking at this interoperable identifier problem
>>>>>>>>>>>> through the lens of catalog resolution but maybe the right 
>>>>>>>>>>>> approach is
>>>>>>>>>>>> really about templating.
>>>>>>>>>>>>
>>>>>>>>>>>> I would extend Russell's idea to allow identifiers in a view to
>>>>>>>>>>>> span catalogs to support non-Iceberg tables.   Also, the 
>>>>>>>>>>>> default-catalog
>>>>>>>>>>>> property could be templated as well.
>>>>>>>>>>>>
>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>> Benny
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa <
>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Steven! How do you recommend making Spark
>>>>>>>>>>>>> implementation conform to the spec? Do we need Spark SQL 
>>>>>>>>>>>>> extensions and/or
>>>>>>>>>>>>> Spark catalog APIs for that?
>>>>>>>>>>>>>
>>>>>>>>>>>>> How do you recommend reconciling the inconsistencies I shared
>>>>>>>>>>>>> regarding many resolution methods not consistently being followed 
>>>>>>>>>>>>> in
>>>>>>>>>>>>> different scenarios (view vs child table resolution, query vs view
>>>>>>>>>>>>> resolution)? Note these occur when the default catalog is set to 
>>>>>>>>>>>>> a non-null
>>>>>>>>>>>>> value. If it helps, I can share concrete examples.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <
>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The core issue is on the fall back behavior when
>>>>>>>>>>>>>> `default-catalog` is
>>>>>>>>>>>>>> not defined. Current view spec says the fallback should be
>>>>>>>>>>>>>> the catalog
>>>>>>>>>>>>>> where the view is defined. It doesn't really matter what the
>>>>>>>>>>>>>> catalog
>>>>>>>>>>>>>> is named (catalogX) by the read engine.
>>>>>>>>>>>>>> - If a view refers to the tables in the same catalog, this is
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> non-ambiguous and reasonable fallback behavior.
>>>>>>>>>>>>>> - If a view refers to tables from another catalog, catalog
>>>>>>>>>>>>>> names
>>>>>>>>>>>>>> should be included in the reference name already. So no
>>>>>>>>>>>>>> ambiguity
>>>>>>>>>>>>>> there either.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Potential inconsistent naming of catalog is a separate
>>>>>>>>>>>>>> problem, which
>>>>>>>>>>>>>> Iceberg view spec probably cannot solve. We can only
>>>>>>>>>>>>>> recommend that
>>>>>>>>>>>>>> catalog should be named consistently across usage for better
>>>>>>>>>>>>>> interoperability on name references.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This proposal is to change the fallback behavior to engine's
>>>>>>>>>>>>>> session
>>>>>>>>>>>>>> default catalog. I am not sure it is better than the current
>>>>>>>>>>>>>> fallback
>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Today’s Spark behavior explicitly differs from this idea.
>>>>>>>>>>>>>> Spark resolves table identifiers during view creation using the 
>>>>>>>>>>>>>> session’s
>>>>>>>>>>>>>> default catalog, not a supplied `default-catalog`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would argue that is a Spark implementation issue for not
>>>>>>>>>>>>>> conforming
>>>>>>>>>>>>>> to the spec.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
>>>>>>>>>>>>>> <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi Jan,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks again for continuing the discussion. I want to
>>>>>>>>>>>>>> highlight a few fundamental issues around the interpretation of
>>>>>>>>>>>>>> default-catalog:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Here is the real catch:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > * default-catalog cannot logically be defined at view
>>>>>>>>>>>>>> creation time. It would be circular: the view needs to exist 
>>>>>>>>>>>>>> before its
>>>>>>>>>>>>>> metadata (and hence default-catalog) can exist. This is visible 
>>>>>>>>>>>>>> in Spark’s
>>>>>>>>>>>>>> implementation, where `default-catalog` is not used.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > * Introducing a creation-time default-catalog setting would
>>>>>>>>>>>>>> require extending SQL syntax and engine APIs to promote it to a 
>>>>>>>>>>>>>> first-class
>>>>>>>>>>>>>> view concept. This would be intrusive, non-intuitive, and 
>>>>>>>>>>>>>> realistically
>>>>>>>>>>>>>> very difficult to standardize across engines.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > * Today’s Spark behavior explicitly differs from this idea.
>>>>>>>>>>>>>> Spark resolves table identifiers during view creation using the 
>>>>>>>>>>>>>> session’s
>>>>>>>>>>>>>> default catalog, not a supplied `default-catalog`.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > * Hypothetically even if we patched in a creation-time
>>>>>>>>>>>>>> default-catalog, it would create an inconsistent binding model 
>>>>>>>>>>>>>> between
>>>>>>>>>>>>>> tables vs views (early vs late), and between tables in views and 
>>>>>>>>>>>>>> in queries
>>>>>>>>>>>>>> (again early vs late). For example, views and tables in queries 
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> withstand default catalog renames, but tables cannot when they 
>>>>>>>>>>>>>> are used
>>>>>>>>>>>>>> inside views -- it even applies to views inside views, which 
>>>>>>>>>>>>>> makes this
>>>>>>>>>>>>>> very hard to reason about considering nesting.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>> > Walaa
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> @Walaa:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> I would argue that when you run a CREATE VIEW statement
>>>>>>>>>>>>>> the query engine knowns which catalog the view is being created 
>>>>>>>>>>>>>> in. So even
>>>>>>>>>>>>>> though we typically use late binding to resolve the view catalog 
>>>>>>>>>>>>>> at query
>>>>>>>>>>>>>> time, it can also be used at creation time.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> The query engine would need to keep track of the "view
>>>>>>>>>>>>>> catalog" where the view is going to be created in. It can use 
>>>>>>>>>>>>>> that catalog
>>>>>>>>>>>>>> to resolve partial table identifiers if "default-catalog" is not 
>>>>>>>>>>>>>> set.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> It can lead to some unintuitive behavior, where partial
>>>>>>>>>>>>>> identifiers in the view query resolve to a different catalog 
>>>>>>>>>>>>>> compared to
>>>>>>>>>>>>>> using them outside of a view.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>>>>>>>>>>>>>> sales.orders;
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> If the session default catalog is not "catalogA", the
>>>>>>>>>>>>>> "sales.orders" in the view query would not be the same as just 
>>>>>>>>>>>>>> referencing
>>>>>>>>>>>>>> "sales.orders" in a normal SQL statement. This is because 
>>>>>>>>>>>>>> without a
>>>>>>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would 
>>>>>>>>>>>>>> default to
>>>>>>>>>>>>>> "catalogA", which is the view's catalog.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Thanks,
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Jan
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> On 4/25/25 04:05, Manu Zhang wrote:
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> For example, if we want to validate that the tables
>>>>>>>>>>>>>> referenced in the view exist, how can we do that when 
>>>>>>>>>>>>>> default-catalog isn't
>>>>>>>>>>>>>> defined, since the view hasn't been created or loaded yet?
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> I don't think this is related to view spec. How do we
>>>>>>>>>>>>>> validate that a table exists without a default catalog, or do we 
>>>>>>>>>>>>>> always use
>>>>>>>>>>>>>> the current session catalog?
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Thanks,
>>>>>>>>>>>>>> >> Manu
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <
>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> Hi Jan,
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> I think we still share the same understanding. Just to
>>>>>>>>>>>>>> clarify: when I referred to late binding as “similar” to the 
>>>>>>>>>>>>>> proposal, I
>>>>>>>>>>>>>> was acknowledging the distinction between view-level and 
>>>>>>>>>>>>>> table-level
>>>>>>>>>>>>>> resolution. But as you noted, both follow a late binding model.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> That said, this still raises an interesting question and
>>>>>>>>>>>>>> a potential gap: if default-catalog is only defined at query 
>>>>>>>>>>>>>> time, how
>>>>>>>>>>>>>> should resolution work during view creation? For example, if we 
>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>> validate that the tables referenced in the view exist, how can 
>>>>>>>>>>>>>> we do that
>>>>>>>>>>>>>> when default-catalog isn't defined, since the view hasn't been 
>>>>>>>>>>>>>> created or
>>>>>>>>>>>>>> loaded yet?
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> Thanks,
>>>>>>>>>>>>>> >>> Walaa.
>>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Yes, I have the same understanding. The view catalog is
>>>>>>>>>>>>>> resolved at query time.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> As you mentioned before, it's good to distinguish
>>>>>>>>>>>>>> between the physical catalog and it's reference used in SQL 
>>>>>>>>>>>>>> statements. The
>>>>>>>>>>>>>> important part is that the physical catalog of the view and the 
>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>> referenced in it's definition stay consistent. You could create 
>>>>>>>>>>>>>> a view in a
>>>>>>>>>>>>>> given physical catalog by referring to it as "catalogA", as in 
>>>>>>>>>>>>>> your first
>>>>>>>>>>>>>> point. If you then, given a different setup, refer to the same 
>>>>>>>>>>>>>> physical
>>>>>>>>>>>>>> catalog as "catalogB" in another session/environment, the 
>>>>>>>>>>>>>> behavior should
>>>>>>>>>>>>>> still work.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> I would however rephrase your last point. Late binding
>>>>>>>>>>>>>> applies to the view catalog name and by extension to all partial 
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> references when no "default-catalog" is present. Resolving the 
>>>>>>>>>>>>>> view catalog
>>>>>>>>>>>>>> name at query time is not opposed to storing the view metadata 
>>>>>>>>>>>>>> in a catalog.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Or maybe I don't entirely understand what you mean.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Thanks
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Jan
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Hi Jan,
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> > The view is executed when it's being referenced in a
>>>>>>>>>>>>>> SQL statement. That statement contains the information for the 
>>>>>>>>>>>>>> query engine
>>>>>>>>>>>>>> to resolve the catalog of the view.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> If I’m understanding correctly, that means:
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> * If the view is queried as SELECT * FROM
>>>>>>>>>>>>>> catalogA.namespace.view, then catalogA is considered the view’s 
>>>>>>>>>>>>>> catalog.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> * If the same view is later queried as SELECT * FROM
>>>>>>>>>>>>>> catalogB.namespace.view (after renaming catalogA to catalogB, 
>>>>>>>>>>>>>> and keeping
>>>>>>>>>>>>>> everything else the same), then catalogB becomes the view’s 
>>>>>>>>>>>>>> catalog.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Is that interpretation correct? If so, it sounds to me
>>>>>>>>>>>>>> like the catalog is resolved at query time, based on how the 
>>>>>>>>>>>>>> view is
>>>>>>>>>>>>>> referenced, not from any stored metadata. That would imply some 
>>>>>>>>>>>>>> sort of a
>>>>>>>>>>>>>> late binding behavior (similar to the proposal), as opposed to 
>>>>>>>>>>>>>> using some
>>>>>>>>>>>>>> catalog that "stores" the view definition.
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> Thanks,
>>>>>>>>>>>>>> >>>> Walaa
>>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Hi Walaa,
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Thanks for clarifying the aspects of non-determinism.
>>>>>>>>>>>>>> Let me try to address your questions.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> 1. This is my interpretation of the current spec: The
>>>>>>>>>>>>>> view is executed when it's being referenced in a SQL statement. 
>>>>>>>>>>>>>> That
>>>>>>>>>>>>>> statement contains the information for the query engine to 
>>>>>>>>>>>>>> resolve the
>>>>>>>>>>>>>> catalog of the view. The query engine then uses that information 
>>>>>>>>>>>>>> to fetch
>>>>>>>>>>>>>> the view metadata from the catalog. It also needs to temporarily 
>>>>>>>>>>>>>> keep track
>>>>>>>>>>>>>> of which catalog it used to fetch the view metadata. It can then 
>>>>>>>>>>>>>> use that
>>>>>>>>>>>>>> information to resolve the table references in the views SQL 
>>>>>>>>>>>>>> definition in
>>>>>>>>>>>>>> case no default catalog is specified.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> 2. The important part is that the catalog can be
>>>>>>>>>>>>>> referenced at execution time. As long as that's the case I would 
>>>>>>>>>>>>>> assume the
>>>>>>>>>>>>>> view can be created in any catalog.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> I think your point is really valuable because the
>>>>>>>>>>>>>> current specification can lead to some unintuitive behavior. For 
>>>>>>>>>>>>>> example
>>>>>>>>>>>>>> for the following statement:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT *
>>>>>>>>>>>>>> from sales.orders;
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> If the session default catalog is not "catalogA", the
>>>>>>>>>>>>>> "sales.orders" in the view query would not be the same as just 
>>>>>>>>>>>>>> referencing
>>>>>>>>>>>>>> "sales.orders" in a normal SQL statement. This is because 
>>>>>>>>>>>>>> without a
>>>>>>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would 
>>>>>>>>>>>>>> default to
>>>>>>>>>>>>>> "catalogA".
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> However, I like the current design of the view spec,
>>>>>>>>>>>>>> because it has the "closure" property. Because of the fact that 
>>>>>>>>>>>>>> the "view
>>>>>>>>>>>>>> catalog" has to be known when executing a view, all the 
>>>>>>>>>>>>>> information
>>>>>>>>>>>>>> required to resolve the table identifiers is contained in the 
>>>>>>>>>>>>>> view metadata
>>>>>>>>>>>>>> (and the "view catalog"). I think that if you make the identifier
>>>>>>>>>>>>>> resolution dependent on external parameters, it hinders 
>>>>>>>>>>>>>> portability.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Thanks,
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Jan
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Hi Jan,
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Thanks for the thoughtful feedback.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> I think it’s important we clarify a key point before
>>>>>>>>>>>>>> going deeper:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Non-determinism is not caused by session fallback
>>>>>>>>>>>>>> behavior—it’s a fundamental limitation of using table 
>>>>>>>>>>>>>> identifiers alone,
>>>>>>>>>>>>>> regardless of whether we use the current rule, the proposed 
>>>>>>>>>>>>>> fallback to the
>>>>>>>>>>>>>> session’s default catalog, or even early vs. late binding.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> The same fully qualified identifier (e.g.,
>>>>>>>>>>>>>> catalogA.namespace.table) can resolve to different objects 
>>>>>>>>>>>>>> depending solely
>>>>>>>>>>>>>> on engine-specific routing logic or catalog aliases. So 
>>>>>>>>>>>>>> determinism isn’t
>>>>>>>>>>>>>> guaranteed just because an identifier is "fully qualified." The 
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> reliable anchor for identity is the UUID. That’s why the 
>>>>>>>>>>>>>> proposed use of
>>>>>>>>>>>>>> UUIDs is not just a hardening strategy. It’s the actual fix for 
>>>>>>>>>>>>>> correctness.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> To move the conversation forward, could you help
>>>>>>>>>>>>>> clarify two things in the context of the current spec:
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> * Where in the metadata is the “view catalog” stored,
>>>>>>>>>>>>>> so that an engine knows to fall back to it if default-catalog is 
>>>>>>>>>>>>>> null?
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> * Are we even allowed to create views in the session's
>>>>>>>>>>>>>> default catalog (i.e., without specifying a catalog) in the 
>>>>>>>>>>>>>> current Iceberg
>>>>>>>>>>>>>> spec?
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> These questions are important because if we can’t
>>>>>>>>>>>>>> unambiguously recover the "view catalog" from metadata, then 
>>>>>>>>>>>>>> defaulting to
>>>>>>>>>>>>>> it is problematic. And if views can't be created in the default 
>>>>>>>>>>>>>> catalog,
>>>>>>>>>>>>>> then the fallback rule doesn’t generalize.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> Thanks,
>>>>>>>>>>>>>> >>>>> Walaa.
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>>
>>>>>>>>>>>>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Hi Walaa,
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> thank you for your proposal. If I understood
>>>>>>>>>>>>>> correctly, you proposal is composed of three parts:
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> - session default catalog as fallback for
>>>>>>>>>>>>>> "default-catalog"
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> - session default namespace as fallback for
>>>>>>>>>>>>>> "default-namepace"
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> - Late binding + UUID validation
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> I have some comments regarding these points.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> 1. Session default catalog as fallback for
>>>>>>>>>>>>>> "default-catalog"
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Introducing a behavior that depends on the current
>>>>>>>>>>>>>> session setup is in my opinion the definition of 
>>>>>>>>>>>>>> "non-determinism". You
>>>>>>>>>>>>>> could be running the same query-engine and catalog-setup on 
>>>>>>>>>>>>>> different days,
>>>>>>>>>>>>>> with different default session catalogs (which is rather 
>>>>>>>>>>>>>> common), and would
>>>>>>>>>>>>>> be getting different results.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Whereas with the current behavior, the view always
>>>>>>>>>>>>>> produces the same results. The current behavior has some rough 
>>>>>>>>>>>>>> edges in
>>>>>>>>>>>>>> very niche use cases but I think is solid for most uses cases.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> 2. Session default namespace as fallback for
>>>>>>>>>>>>>> "default-namespace"
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Similar to the above.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> 3. Late binding + UUID validation
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> If I understand it correctly, the current
>>>>>>>>>>>>>> implementation already uses late binding.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Generally, having UUID validation makes the setup more
>>>>>>>>>>>>>> robust. Which is great. However, having UUID validation still 
>>>>>>>>>>>>>> requires us
>>>>>>>>>>>>>> to have a portable table identifier specification. Even if we 
>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>> UUIDs of the referenced tables from the view, there simply isn't 
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>> interface that let's us use those UUIDs. The catalog interface 
>>>>>>>>>>>>>> is defined
>>>>>>>>>>>>>> in terms of table identifiers.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> So we always require a working catalog setup and
>>>>>>>>>>>>>> suiting table identifiers to obtain the table metadata. We can 
>>>>>>>>>>>>>> use the
>>>>>>>>>>>>>> UUIDs to verify if we loaded the correct table. But this can 
>>>>>>>>>>>>>> only be done
>>>>>>>>>>>>>> after we used some identifier. Which means there is no way of 
>>>>>>>>>>>>>> using UUIDs
>>>>>>>>>>>>>> without a functioning catalog/identifier setup.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> In conclusion, I prefer the current behavior for
>>>>>>>>>>>>>> "default-catalog" because it is more deterministic in my 
>>>>>>>>>>>>>> opinion. And I
>>>>>>>>>>>>>> think the current spec does a good job for multi-engine table 
>>>>>>>>>>>>>> identifier
>>>>>>>>>>>>>> resolution. I see the UUID validation more of an additional 
>>>>>>>>>>>>>> hardening
>>>>>>>>>>>>>> strategy.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Thanks
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Jan
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Thanks Renjie!
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> The existing spec has some guidance on resolving
>>>>>>>>>>>>>> catalogs on the fly already (to address the case of view text 
>>>>>>>>>>>>>> with table
>>>>>>>>>>>>>> identifiers missing the catalog part). The guidance is to use 
>>>>>>>>>>>>>> the catalog
>>>>>>>>>>>>>> where the view is stored. But I find this rule hard to interpret 
>>>>>>>>>>>>>> or use.
>>>>>>>>>>>>>> The catalog itself is a logical construct—such as a federated 
>>>>>>>>>>>>>> catalog that
>>>>>>>>>>>>>> delegates to multiple physical backends (e.g., HMS and REST). In 
>>>>>>>>>>>>>> such
>>>>>>>>>>>>>> cases, the catalog (e.g., `my_catalog` in 
>>>>>>>>>>>>>> `my_catalog.namespace1.table1`)
>>>>>>>>>>>>>> doesn’t physically store the tables; it only routes requests to 
>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>> stores. Therefore, defaulting identifier resolution based on the 
>>>>>>>>>>>>>> catalog
>>>>>>>>>>>>>> where the view is "stored" doesn’t align with how catalogs 
>>>>>>>>>>>>>> actually behave
>>>>>>>>>>>>>> in practice.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>> Walaa.
>>>>>>>>>>>>>> >>>>>>
>>>>>>>>>>>>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <
>>>>>>>>>>>>>> liurenjie2...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> Hi, Walaa:
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> Thanks for the proposal.
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> I've reviewed the doc, but in general I have some
>>>>>>>>>>>>>> concerns with resolving catalog names on the fly with query 
>>>>>>>>>>>>>> engine defined
>>>>>>>>>>>>>> catalog names. This introduces some flexibility at first glance, 
>>>>>>>>>>>>>> but also
>>>>>>>>>>>>>> makes misconfiguration difficult to explain.
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> But I agree with one part that we should store
>>>>>>>>>>>>>> resolved table uuid in view metadata, as table/view renaming may 
>>>>>>>>>>>>>> introduce
>>>>>>>>>>>>>> errors that's difficult to understand for user.
>>>>>>>>>>>>>> >>>>>>>
>>>>>>>>>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Hi Everyone,
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Looking forward to keeping up the momentum and
>>>>>>>>>>>>>> closing out the MV spec as well. I’m hoping we can proceed to a 
>>>>>>>>>>>>>> vote next
>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Here is a summary in case that helps. The proposal
>>>>>>>>>>>>>> outlines a strategy for handling table identifiers in Iceberg 
>>>>>>>>>>>>>> view
>>>>>>>>>>>>>> metadata, with the goal of ensuring correctness, portability, 
>>>>>>>>>>>>>> and engine
>>>>>>>>>>>>>> compatibility. It recommends resolving table identifiers at read 
>>>>>>>>>>>>>> time (late
>>>>>>>>>>>>>> binding) rather than creation time, and introduces UUID-based 
>>>>>>>>>>>>>> validation to
>>>>>>>>>>>>>> maintain identity guarantees across engines, or sessions. It 
>>>>>>>>>>>>>> also revises
>>>>>>>>>>>>>> how default-catalog and default-namespace are handled 
>>>>>>>>>>>>>> (defaulting both to
>>>>>>>>>>>>>> the session context if not explicitly set) to better align with 
>>>>>>>>>>>>>> engine
>>>>>>>>>>>>>> behavior and improve cross-engine interoperability.
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Please let me know your thoughts.
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>>>> Walaa.
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>
>>>>>>>>>>>>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa
>>>>>>>>>>>>>> <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the
>>>>>>>>>>>>>> comments.
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> One key point to keep in mind is that catalog names
>>>>>>>>>>>>>> in the spec refer to logical catalogs—i.e., the first part of a 
>>>>>>>>>>>>>> three-part
>>>>>>>>>>>>>> identifier. These correspond to Spark's DataSourceV2 catalogs, 
>>>>>>>>>>>>>> Trino
>>>>>>>>>>>>>> connectors, and similar constructs. This is a level of 
>>>>>>>>>>>>>> abstraction above
>>>>>>>>>>>>>> physical catalogs, which are not referenced or used in the view 
>>>>>>>>>>>>>> spec. The
>>>>>>>>>>>>>> reason is that table identifiers in the view definition/text 
>>>>>>>>>>>>>> itself refer
>>>>>>>>>>>>>> to logical catalogs, not physical ones (since they interface 
>>>>>>>>>>>>>> directly with
>>>>>>>>>>>>>> the engine and not a specific metastore).
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>>>>> Walaa.
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <
>>>>>>>>>>>>>> sungwy...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view
>>>>>>>>>>>>>> portability is a very important topic for us to continue 
>>>>>>>>>>>>>> discussing as it
>>>>>>>>>>>>>> relies on many assumptions within the data ecosystem for it to 
>>>>>>>>>>>>>> function
>>>>>>>>>>>>>> like you've highlighted well in the document.
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> I've added a few comments around how this may
>>>>>>>>>>>>>> impact the permission questions the engines will be asking, and 
>>>>>>>>>>>>>> whether
>>>>>>>>>>>>>> that is the desired behavior.
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> Sung
>>>>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner
>>>>>>>>>>>>>> <etudenhoef...@apache.org> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've
>>>>>>>>>>>>>> added a few comments to get a better understanding of how this 
>>>>>>>>>>>>>> will look
>>>>>>>>>>>>>> like in the actual implementation.
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> Eduard
>>>>>>>>>>>>>> >>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin
>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> Starting this thread to resume our discussion on
>>>>>>>>>>>>>> how to reference table identifiers from Iceberg metadata, a key 
>>>>>>>>>>>>>> aspect of
>>>>>>>>>>>>>> the view specification, particularly in relation to the MV 
>>>>>>>>>>>>>> (materialized
>>>>>>>>>>>>>> view) extensions.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> I had the chance to speak offline with a few
>>>>>>>>>>>>>> community members to better understand how the current spec is 
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> interpreted. Those conversations served as inputs to a new 
>>>>>>>>>>>>>> proposal on how
>>>>>>>>>>>>>> table identifier references could be represented in metadata.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> You can find the proposal here [1]. I look
>>>>>>>>>>>>>> forward to your feedback and working together to move this 
>>>>>>>>>>>>>> forward so we
>>>>>>>>>>>>>> can finalize the MV spec as well.
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
>>>>>>>>>>>>>> >>>>>>>>>>>>
>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> >>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to