Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Manu Zhang Thu, 24 Apr 2025 19:07:05 -0700

>
> For example, if we want to validate that the tables referenced in the view
> exist, how can we do that when default-catalog isn't defined, since the
> view hasn't been created or loaded yet?


I don't think this is related to view spec. How do we validate that a table
exists without a default catalog, or do we always use the current session
catalog?

Thanks,
Manu

On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Jan,
>
> I think we still share the same understanding. Just to clarify: when I
> referred to late binding as “similar” to the proposal, I was acknowledging
> the distinction between view-level and table-level resolution. But as you
> noted, both follow a late binding model.
>
> That said, this still raises an interesting question and a potential gap:
> if default-catalog is only defined at query time, how should resolution
> work during view creation? For example, if we want to validate that the
> tables referenced in the view exist, how can we do that when
> default-catalog isn't defined, since the view hasn't been created or loaded
> yet?
>
> Thanks,
> Walaa.
>
> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
>
>> Yes, I have the same understanding. The view catalog is resolved at query
>> time.
>>
>> As you mentioned before, it's good to distinguish between the physical
>> catalog and it's reference used in SQL statements. The important part is
>> that the physical catalog of the view and the tables referenced in it's
>> definition stay consistent. You could create a view in a given physical
>> catalog by referring to it as "catalogA", as in your first point. If you
>> then, given a different setup, refer to the same physical catalog as
>> "catalogB" in another session/environment, the behavior should still work.
>>
>> I would however rephrase your last point. Late binding applies to the
>> view catalog name and by extension to all partial table references when no
>> "default-catalog" is present. Resolving the view catalog name at query time
>> is not opposed to storing the view metadata in a catalog.
>>
>> Or maybe I don't entirely understand what you mean.
>>
>> Thanks
>>
>> Jan
>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
>>
>> Hi Jan,
>>
>> > The view is executed when it's being referenced in a SQL statement.
>> That statement contains the information for the query engine to resolve the
>> catalog of the view.
>>
>> If I’m understanding correctly, that means:
>>
>> * If the view is queried as SELECT * FROM catalogA.namespace.view, then
>> catalogA is considered the view’s catalog.
>>
>> * If the same view is later queried as SELECT * FROM
>> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping
>> everything else the same), then catalogB becomes the view’s catalog.
>>
>> Is that interpretation correct? If so, it sounds to me like the catalog
>> is resolved at query time, based on how the view is referenced, not from
>> any stored metadata. That would imply some sort of a late binding behavior
>> (similar to the proposal), as opposed to using some catalog that "stores"
>> the view definition.
>>
>> Thanks,
>> Walaa
>>
>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul <jank...@mailbox.org.invalid>
>> <jank...@mailbox.org.invalid> wrote:
>>
>>> Hi Walaa,
>>>
>>> Thanks for clarifying the aspects of non-determinism. Let me try to
>>> address your questions.
>>>
>>> 1. This is my interpretation of the current spec: The view is executed
>>> when it's being referenced in a SQL statement. That statement contains the
>>> information for the query engine to resolve the catalog of the view. The
>>> query engine then uses that information to fetch the view metadata from the
>>> catalog. It also needs to temporarily keep track of which catalog it used
>>> to fetch the view metadata. It can then use that information to resolve the
>>> table references in the views SQL definition in case no default catalog is
>>> specified.
>>>
>>> 2. The important part is that the catalog can be referenced at execution
>>> time. As long as that's the case I would assume the view can be created in
>>> any catalog.
>>>
>>>
>>> I think your point is really valuable because the current specification
>>> can lead to some unintuitive behavior. For example for the following
>>> statement:
>>>
>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from sales.orders;
>>>
>>> If the session default catalog is not "catalogA", the "sales.orders" in
>>> the view query would not be the same as just referencing "sales.orders" in
>>> a normal SQL statement. This is because without a "default-catalog", the
>>> catalog name of "sales.orders" would default to "catalogA".
>>>
>>>
>>> However, I like the current design of the view spec, because it has the
>>> "closure" property. Because of the fact that the "view catalog" has to be
>>> known when executing a view, all the information required to resolve the
>>> table identifiers is contained in the view metadata (and the "view
>>> catalog"). I think that if you make the identifier resolution dependent on
>>> external parameters, it hinders portability.
>>>
>>> Thanks,
>>>
>>> Jan
>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
>>>
>>> Hi Jan,
>>>
>>> Thanks for the thoughtful feedback.
>>>
>>> I think it’s important we clarify a key point before going deeper:
>>>
>>> Non-determinism is not caused by session fallback behavior—it’s a 
>>> *fundamental
>>> limitation of using table identifiers* alone, regardless of whether we
>>> use the current rule, the proposed fallback to the session’s default
>>> catalog, or even early vs. late binding.
>>>
>>> The same fully qualified identifier (e.g., catalogA.namespace.table) can
>>> resolve to different objects depending solely on engine-specific routing
>>> logic or catalog aliases. So determinism isn’t guaranteed just because an
>>> identifier is "fully qualified." The only reliable anchor for identity is
>>> the UUID. That’s why the proposed use of UUIDs is not just a hardening
>>> strategy. It’s the actual fix for correctness.
>>>
>>> To move the conversation forward, could you help clarify two things in
>>> the context of the current spec:
>>>
>>> * Where in the metadata is the “view catalog” stored, so that an engine
>>> knows to fall back to it if default-catalog is null?
>>>
>>> * Are we even allowed to create views in the session's default catalog
>>> (i.e., without specifying a catalog) in the current Iceberg spec?
>>>
>>> These questions are important because if we can’t unambiguously recover
>>> the "view catalog" from metadata, then defaulting to it is problematic. And
>>> if views can't be created in the default catalog, then the fallback rule
>>> doesn’t generalize.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> <jank...@mailbox.org.invalid> wrote:
>>>
>>>> Hi Walaa,
>>>>
>>>> thank you for your proposal. If I understood correctly, you proposal is
>>>> composed of three parts:
>>>>
>>>> - session default catalog as fallback for "default-catalog"
>>>>
>>>> - session default namespace as fallback for "default-namepace"
>>>>
>>>> - Late binding + UUID validation
>>>>
>>>> I have some comments regarding these points.
>>>>
>>>>
>>>> 1. Session default catalog as fallback for "default-catalog"
>>>>
>>>> Introducing a behavior that depends on the current session setup is in
>>>> my opinion the definition of "non-determinism". You could be running the
>>>> same query-engine and catalog-setup on different days, with different
>>>> default session catalogs (which is rather common), and would be getting
>>>> different results.
>>>>
>>>> Whereas with the current behavior, the view always produces the same
>>>> results. The current behavior has some rough edges in very niche use cases
>>>> but I think is solid for most uses cases.
>>>> 2. Session default namespace as fallback for "default-namespace"
>>>>
>>>> Similar to the above.
>>>> 3. Late binding + UUID validation
>>>>
>>>> If I understand it correctly, the current implementation already uses
>>>> late binding.
>>>>
>>>> Generally, having UUID validation makes the setup more robust. Which is
>>>> great. However, having UUID validation still requires us to have a portable
>>>> table identifier specification. Even if we have the UUIDs of the referenced
>>>> tables from the view, there simply isn't an interface that let's us use
>>>> those UUIDs. The catalog interface is defined in terms of table 
>>>> identifiers.
>>>>
>>>> So we always require a working catalog setup and suiting table
>>>> identifiers to obtain the table metadata. We can use the UUIDs to verify if
>>>> we loaded the correct table. But this can only be done after we used some
>>>> identifier. Which means there is no way of using UUIDs without a
>>>> functioning catalog/identifier setup.
>>>>
>>>>
>>>> In conclusion, I prefer the current behavior for "default-catalog"
>>>> because it is more deterministic in my opinion. And I think the current
>>>> spec does a good job for multi-engine table identifier resolution. I see
>>>> the UUID validation more of an additional hardening strategy.
>>>>
>>>> Thanks
>>>>
>>>> Jan
>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
>>>>
>>>> Thanks Renjie!
>>>>
>>>> The existing spec has some guidance on resolving catalogs on the fly
>>>> already (to address the case of view text with table identifiers missing
>>>> the catalog part). The guidance is to use the catalog where the view is
>>>> stored. But I find this rule hard to interpret or use. The catalog itself
>>>> is a logical construct—such as a federated catalog that delegates to
>>>> multiple physical backends (e.g., HMS and REST). In such cases, the catalog
>>>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically
>>>> store the tables; it only routes requests to underlying stores. Therefore,
>>>> defaulting identifier resolution based on the catalog where the view is
>>>> "stored" doesn’t align with how catalogs actually behave in practice.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <liurenjie2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, Walaa:
>>>>>
>>>>> Thanks for the proposal.
>>>>>
>>>>> I've reviewed the doc, but in general I have some concerns with
>>>>> resolving catalog names on the fly with query engine defined catalog 
>>>>> names.
>>>>> This introduces some flexibility at first glance, but also makes
>>>>> misconfiguration difficult to explain.
>>>>>
>>>>> But I agree with one part that we should store resolved table uuid in
>>>>> view metadata, as table/view renaming may introduce errors that's 
>>>>> difficult
>>>>> to understand for user.
>>>>>
>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> Looking forward to keeping up the momentum and closing out the MV
>>>>>> spec as well. I’m hoping we can proceed to a vote next week.
>>>>>>
>>>>>> Here is a summary in case that helps. The proposal outlines a
>>>>>> strategy for handling table identifiers in Iceberg view metadata, with 
>>>>>> the
>>>>>> goal of ensuring correctness, portability, and engine compatibility. It
>>>>>> recommends resolving table identifiers at read time (late binding) rather
>>>>>> than creation time, and introduces UUID-based validation to maintain
>>>>>> identity guarantees across engines, or sessions. It also revises how
>>>>>> default-catalog and default-namespace are handled (defaulting both to the
>>>>>> session context if not explicitly set) to better align with engine 
>>>>>> behavior
>>>>>> and improve cross-engine interoperability.
>>>>>>
>>>>>> Please let me know your thoughts.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Eduard and Sung! I have addressed the comments.
>>>>>>>
>>>>>>> One key point to keep in mind is that catalog names in the spec
>>>>>>> refer to logical catalogs—i.e., the first part of a three-part 
>>>>>>> identifier.
>>>>>>> These correspond to Spark's DataSourceV2 catalogs, Trino connectors, and
>>>>>>> similar constructs. This is a level of abstraction above physical 
>>>>>>> catalogs,
>>>>>>> which are not referenced or used in the view spec. The reason is that 
>>>>>>> table
>>>>>>> identifiers in the view definition/text itself refer to logical 
>>>>>>> catalogs,
>>>>>>> not physical ones (since they interface directly with the engine and 
>>>>>>> not a
>>>>>>> specific metastore).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you Walaa for the proposal. I think view portability is a
>>>>>>>> very important topic for us to continue discussing as it relies on many
>>>>>>>> assumptions within the data ecosystem for it to function like you've
>>>>>>>> highlighted well in the document.
>>>>>>>>
>>>>>>>> I've added a few comments around how this may impact the permission
>>>>>>>> questions the engines will be asking, and whether that is the desired
>>>>>>>> behavior.
>>>>>>>>
>>>>>>>> Sung
>>>>>>>>
>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner <
>>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few comments
>>>>>>>>> to get a better understanding of how this will look like in the actual
>>>>>>>>> implementation.
>>>>>>>>>
>>>>>>>>> Eduard
>>>>>>>>>
>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> Starting this thread to resume our discussion on how to reference
>>>>>>>>>> table identifiers from Iceberg metadata, a key aspect of the view
>>>>>>>>>> specification, particularly in relation to the MV (materialized view)
>>>>>>>>>> extensions.
>>>>>>>>>>
>>>>>>>>>> I had the chance to speak offline with a few community members to
>>>>>>>>>> better understand how the current spec is being interpreted. Those
>>>>>>>>>> conversations served as inputs to a new proposal on how table 
>>>>>>>>>> identifier
>>>>>>>>>> references could be represented in metadata.
>>>>>>>>>>
>>>>>>>>>> You can find the proposal here [1]. I look forward to your
>>>>>>>>>> feedback and working together to move this forward so we can 
>>>>>>>>>> finalize the
>>>>>>>>>> MV spec as well.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to