Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Jan Kaul Fri, 25 Apr 2025 07:02:16 -0700

@Walaa:

I would argue that when you run a CREATE VIEW statement the query engineknowns which catalog the view is being created in. So even though wetypically use late binding to resolve the view catalog at query time, itcan also be used at creation time.

The query engine would need to keep track of the "view catalog" wherethe view is going to be created in. It can use that catalog to resolvepartial table identifiers if "default-catalog" is not set.

It can lead to some unintuitive behavior, where partial identifiers inthe view query resolve to a different catalog compared to using themoutside of a view.


CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from sales.orders;

If the session default catalog is not "catalogA", the "sales.orders" inthe view query would not be the same as just referencing "sales.orders"in a normal SQL statement. This is because without a "default-catalog",the catalog name of "sales.orders" would default to "catalogA", which isthe view's catalog.


Thanks,

Jan

On 4/25/25 04:05, Manu Zhang wrote:


    For example, if we want to validate that the tables referenced in
    the view exist, how can we do that when default-catalog isn't
    defined, since the view hasn't been created or loaded yet?

I don't think this is related to view spec. How do we validate that atable exists without a default catalog, or do we always use thecurrent session catalog?


Thanks,
Manu

On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa<wa.moust...@gmail.com> wrote:


    Hi Jan,

    I think we still share the same understanding. Just to clarify:
    when I referred to late binding as “similar” to the proposal, I
    was acknowledging the distinction between view-level and
    table-level resolution. But as you noted, both follow a late
    binding model.

    That said, this still raises an interesting question and a
    potential gap: if default-catalog is only defined at query time,
    how should resolution work during view creation? For example, if
    we want to validate that the tables referenced in the view exist,
    how can we do that when default-catalog isn't defined, since the
    view hasn't been created or loaded yet?

    Thanks,
    Walaa.

    On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul
    <jank...@mailbox.org.invalid> wrote:

        Yes, I have the same understanding. The view catalog is
        resolved at query time.

        As you mentioned before, it's good to distinguish between the
        physical catalog and it's reference used in SQL statements.
        The important part is that the physical catalog of the view
        and the tables referenced in it's definition stay consistent.
        You could create a view in a given physical catalog by
        referring to it as "catalogA", as in your first point. If you
        then, given a different setup, refer to the same physical
        catalog as "catalogB" in another session/environment, the
        behavior should still work.

        I would however rephrase your last point. Late binding applies
        to the view catalog name and by extension to all partial table
        references when no "default-catalog" is present. Resolving the
        view catalog name at query time is not opposed to storing the
        view metadata in a catalog.

        Or maybe I don't entirely understand what you mean.

        Thanks

        Jan

        On 4/24/25 00:32, Walaa Eldin Moustafa wrote:

        Hi Jan,

        > The view is executed when it's being referenced in a SQL
        statement. That statement contains the information for the
        query engine to resolve the catalog of the view.

        If I’m understanding correctly, that means:

        * If the view is queried as SELECT * FROM
        catalogA.namespace.view, then catalogA is considered the
        view’s catalog.

        * If the same view is later queried as SELECT * FROM
        catalogB.namespace.view (after renaming catalogA to catalogB,
        and keeping everything else the same), then catalogB becomes
        the view’s catalog.

        Is that interpretation correct? If so, it sounds to me like
        the catalog is resolved at query time, based on how the view
        is referenced, not from any stored metadata. That would imply
        some sort of a late binding behavior (similar to the
        proposal), as opposed to using some catalog that "stores" the
        view definition.

        Thanks,
        Walaa

        On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
        <jank...@mailbox.org.invalid>
        <mailto:jank...@mailbox.org.invalid> wrote:

            Hi Walaa,

            Thanks for clarifying the aspects of non-determinism. Let
            me try to address your questions.

            1. This is my interpretation of the current spec: The
            view is executed when it's being referenced in a SQL
            statement. That statement contains the information for
            the query engine to resolve the catalog of the view. The
            query engine then uses that information to fetch the view
            metadata from the catalog. It also needs to temporarily
            keep track of which catalog it used to fetch the view
            metadata. It can then use that information to resolve the
            table references in the views SQL definition in case no
            default catalog is specified.

            2. The important part is that the catalog can be
            referenced at execution time. As long as that's the case
            I would assume the view can be created in any catalog.


            I think your point is really valuable because the current
            specification can lead to some unintuitive behavior. For
            example for the following statement:

            CREATE VIEW catalogA.sales.monthly_orders AS SELECT *
            from sales.orders;

            If the session default catalog is not "catalogA", the
            "sales.orders" in the view query would not be the same as
            just referencing "sales.orders" in a normal SQL
            statement. This is because without a "default-catalog",
            the catalog name of "sales.orders" would default to
            "catalogA".


            However, I like the current design of the view spec,
            because it has the "closure" property. Because of the
            fact that the "view catalog" has to be known when
            executing a view, all the information required to resolve
            the table identifiers is contained in the view metadata
            (and the "view catalog"). I think that if you make the
            identifier resolution dependent on external parameters,
            it hinders portability.

            Thanks,

            Jan

            On 4/22/25 18:36, Walaa Eldin Moustafa wrote:

            Hi Jan,

            Thanks for the thoughtful feedback.

            I think it’s important we clarify a key point before
            going deeper:

            Non-determinism is not caused by session fallback
            behavior—it’s a *fundamental limitation of using table
            identifiers* alone, regardless of whether we use the
            current rule, the proposed fallback to the session’s
            default catalog, or even early vs. late binding.

            The same fully qualified identifier (e.g.,
            catalogA.namespace.table) can resolve to different
            objects depending solely on engine-specific routing
            logic or catalog aliases. So determinism isn’t
            guaranteed just because an identifier is "fully
            qualified." The only reliable anchor for identity is the
            UUID. That’s why the proposed use of UUIDs is not just a
            hardening strategy. It’s the actual fix for correctness.

            To move the conversation forward, could you help clarify
            two things in the context of the current spec:

            * Where in the metadata is the “view catalog” stored, so
            that an engine knows to fall back to it if
            default-catalog is null?

            * Are we even allowed to create views in the session's
            default catalog (i.e., without specifying a catalog) in
            the current Iceberg spec?

            These questions are important because if we can’t
            unambiguously recover the "view catalog" from metadata,
            then defaulting to it is problematic. And if views can't
            be created in the default catalog, then the fallback
            rule doesn’t generalize.

            Thanks,
            Walaa.


            On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
            <jank...@mailbox.org.invalid>
            <mailto:jank...@mailbox.org.invalid> wrote:

                Hi Walaa,

                thank you for your proposal. If I understood
                correctly, you proposal is composed of three parts:

                - session default catalog as fallback for
                "default-catalog"

                - session default namespace as fallback for
                "default-namepace"

                - Late binding + UUID validation

                I have some comments regarding these points.


                        1. Session default catalog as fallback for
                        "default-catalog"

                Introducing a behavior that depends on the current
                session setup is in my opinion the definition of
                "non-determinism". You could be running the same
                query-engine and catalog-setup on different days,
                with different default session catalogs (which is
                rather common), and would be getting different results.

                Whereas with the current behavior, the view always
                produces the same results. The current behavior has
                some rough edges in very niche use cases but I think
                is solid for most uses cases.


                        2. Session default namespace as fallback for
                        "default-namespace"

                Similar to the above.


                        3. Late binding + UUID validation

                If I understand it correctly, the current
                implementation already uses late binding.

                Generally, having UUID validation makes the setup
                more robust. Which is great. However, having UUID
                validation still requires us to have a portable
                table identifier specification. Even if we have the
                UUIDs of the referenced tables from the view, there
                simply isn't an interface that let's us use those
                UUIDs. The catalog interface is defined in terms of
                table identifiers.

                So we always require a working catalog setup and
                suiting table identifiers to obtain the table
                metadata. We can use the UUIDs to verify if we
                loaded the correct table. But this can only be done
                after we used some identifier. Which means there is
                no way of using UUIDs without a functioning
                catalog/identifier setup.


                In conclusion, I prefer the current behavior for
                "default-catalog" because it is more deterministic
                in my opinion. And I think the current spec does a
                good job for multi-engine table identifier
                resolution. I see the UUID validation more of an
                additional hardening strategy.

                Thanks

                Jan

                On 4/21/25 17:38, Walaa Eldin Moustafa wrote:

                Thanks Renjie!

                The existing spec has some guidance on resolving
                catalogs on the fly already (to address the case of
                view text with table identifiers missing the
                catalog part). The guidance is to use the catalog
                where the view is stored. But I find this rule hard
                to interpret or use. The catalog itself is a
                logical construct—such as a federated catalog that
                delegates to multiple physical backends (e.g., HMS
                and REST). In such cases, the catalog (e.g.,
                `my_catalog` in `my_catalog.namespace1.table1`)
                doesn’t physically store the tables; it only routes
                requests to underlying stores. Therefore,
                defaulting identifier resolution based on the
                catalog where the view is "stored" doesn’t align
                with how catalogs actually behave in practice.

                Thanks,
                Walaa.

                On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu
                <liurenjie2...@gmail.com> wrote:

                    Hi, Walaa:

                    Thanks for the proposal.

                    I've reviewed the doc, but in general I have
                    some concerns with resolving catalog names on
                    the fly with query engine defined catalog
                    names. This introduces some flexibility at
                    first glance, but also makes misconfiguration
                    difficult to explain.

                    But I agree with one part that we should store
                    resolved table uuid in view metadata, as
                    table/view renaming may introduce errors that's
                    difficult to understand for user.

                    On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin
                    Moustafa <wa.moust...@gmail.com> wrote:

                        Hi Everyone,

                        Looking forward to keeping up the momentum
                        and closing out the MV spec as well. I’m
                        hoping we can proceed to a vote next week.

                        Here is a summary in case that helps. The
                        proposal outlines a strategy for handling
                        table identifiers in Iceberg view metadata,
                        with the goal of ensuring correctness,
                        portability, and engine compatibility. It
                        recommends resolving table identifiers at
                        read time (late binding) rather than
                        creation time, and introduces UUID-based
                        validation to maintain identity guarantees
                        across engines, or sessions. It also
                        revises how default-catalog and
                        default-namespace are handled (defaulting
                        both to the session context if not
                        explicitly set) to better align with engine
                        behavior and improve cross-engine
                        interoperability.

                        Please let me know your thoughts.

                        Thanks,
                        Walaa.



                        On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin
                        Moustafa <wa.moust...@gmail.com> wrote:

                            Thanks Eduard and Sung! I have
                            addressed the comments.

                            One key point to keep in mind is that
                            catalog names in the spec refer to
                            logical catalogs—i.e., the first part
                            of a three-part identifier. These
                            correspond to Spark's DataSourceV2
                            catalogs, Trino connectors, and similar
                            constructs. This is a level of
                            abstraction above physical catalogs,
                            which are not referenced or used in the
                            view spec. The reason is that table
                            identifiers in the view definition/text
                            itself refer to logical catalogs, not
                            physical ones (since they interface
                            directly with the engine and not a
                            specific metastore).

                            Thanks,
                            Walaa.


                            On Wed, Apr 16, 2025 at 6:15 AM Sung
                            Yun <sungwy...@gmail.com> wrote:

                                Thank you Walaa for the proposal. I
                                think view portability is a very
                                important topic for us to continue
                                discussing as it relies on many
                                assumptions within the data
                                ecosystem for it to function like
                                you've highlighted well in the
                                document.

                                I've added a few comments around
                                how this may impact the permission
                                questions the engines will be
                                asking, and whether that is the
                                desired behavior.

                                Sung

                                On Wed, Apr 16, 2025 at 7:32 AM
                                Eduard Tudenhöfner
                                <etudenhoef...@apache.org> wrote:

                                    Thanks Walaa for tackling this
                                    problem. I've added a few
                                    comments to get a better
                                    understanding of how this will
                                    look like in the actual
                                    implementation.

                                    Eduard

                                    On Tue, Apr 15, 2025 at 7:09 PM
                                    Walaa Eldin Moustafa
                                    <wa.moust...@gmail.com> wrote:

                                        Hi Everyone,

                                        Starting this thread to
                                        resume our discussion on
                                        how to reference table
                                        identifiers from Iceberg
                                        metadata, a key aspect of
                                        the view specification,
                                        particularly in relation to
                                        the MV (materialized view)
                                        extensions.

                                        I had the chance to speak
                                        offline with a few
                                        community members to better
                                        understand how the current
                                        spec is being interpreted.
                                        Those conversations served
                                        as inputs to a new proposal
                                        on how table identifier
                                        references could be
                                        represented in metadata.

                                        You can find the proposal
                                        here [1]. I look forward to
                                        your feedback and working
                                        together to move this
                                        forward so we can finalize
                                        the MV spec as well.

                                        [1]
                                        
https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0

                                        Thanks,
                                        Walaa.

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to