Hi Jan,
Thanks for the thoughtful feedback.
I think it’s important we clarify a key point before
going deeper:
Non-determinism is not caused by session fallback
behavior—it’s a *fundamental limitation of using table
identifiers* alone, regardless of whether we use the
current rule, the proposed fallback to the session’s
default catalog, or even early vs. late binding.
The same fully qualified identifier (e.g.,
catalogA.namespace.table) can resolve to different
objects depending solely on engine-specific routing
logic or catalog aliases. So determinism isn’t
guaranteed just because an identifier is "fully
qualified." The only reliable anchor for identity is the
UUID. That’s why the proposed use of UUIDs is not just a
hardening strategy. It’s the actual fix for correctness.
To move the conversation forward, could you help clarify
two things in the context of the current spec:
* Where in the metadata is the “view catalog” stored, so
that an engine knows to fall back to it if
default-catalog is null?
* Are we even allowed to create views in the session's
default catalog (i.e., without specifying a catalog) in
the current Iceberg spec?
These questions are important because if we can’t
unambiguously recover the "view catalog" from metadata,
then defaulting to it is problematic. And if views can't
be created in the default catalog, then the fallback
rule doesn’t generalize.
Thanks,
Walaa.
On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Hi Walaa,
thank you for your proposal. If I understood
correctly, you proposal is composed of three parts:
- session default catalog as fallback for
"default-catalog"
- session default namespace as fallback for
"default-namepace"
- Late binding + UUID validation
I have some comments regarding these points.
1. Session default catalog as fallback for
"default-catalog"
Introducing a behavior that depends on the current
session setup is in my opinion the definition of
"non-determinism". You could be running the same
query-engine and catalog-setup on different days,
with different default session catalogs (which is
rather common), and would be getting different results.
Whereas with the current behavior, the view always
produces the same results. The current behavior has
some rough edges in very niche use cases but I think
is solid for most uses cases.
2. Session default namespace as fallback for
"default-namespace"
Similar to the above.
3. Late binding + UUID validation
If I understand it correctly, the current
implementation already uses late binding.
Generally, having UUID validation makes the setup
more robust. Which is great. However, having UUID
validation still requires us to have a portable
table identifier specification. Even if we have the
UUIDs of the referenced tables from the view, there
simply isn't an interface that let's us use those
UUIDs. The catalog interface is defined in terms of
table identifiers.
So we always require a working catalog setup and
suiting table identifiers to obtain the table
metadata. We can use the UUIDs to verify if we
loaded the correct table. But this can only be done
after we used some identifier. Which means there is
no way of using UUIDs without a functioning
catalog/identifier setup.
In conclusion, I prefer the current behavior for
"default-catalog" because it is more deterministic
in my opinion. And I think the current spec does a
good job for multi-engine table identifier
resolution. I see the UUID validation more of an
additional hardening strategy.
Thanks
Jan
On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
Thanks Renjie!
The existing spec has some guidance on resolving
catalogs on the fly already (to address the case of
view text with table identifiers missing the
catalog part). The guidance is to use the catalog
where the view is stored. But I find this rule hard
to interpret or use. The catalog itself is a
logical construct—such as a federated catalog that
delegates to multiple physical backends (e.g., HMS
and REST). In such cases, the catalog (e.g.,
`my_catalog` in `my_catalog.namespace1.table1`)
doesn’t physically store the tables; it only routes
requests to underlying stores. Therefore,
defaulting identifier resolution based on the
catalog where the view is "stored" doesn’t align
with how catalogs actually behave in practice.
Thanks,
Walaa.
On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu
<liurenjie2...@gmail.com> wrote:
Hi, Walaa:
Thanks for the proposal.
I've reviewed the doc, but in general I have
some concerns with resolving catalog names on
the fly with query engine defined catalog
names. This introduces some flexibility at
first glance, but also makes misconfiguration
difficult to explain.
But I agree with one part that we should store
resolved table uuid in view metadata, as
table/view renaming may introduce errors that's
difficult to understand for user.
On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin
Moustafa <wa.moust...@gmail.com> wrote:
Hi Everyone,
Looking forward to keeping up the momentum
and closing out the MV spec as well. I’m
hoping we can proceed to a vote next week.
Here is a summary in case that helps. The
proposal outlines a strategy for handling
table identifiers in Iceberg view metadata,
with the goal of ensuring correctness,
portability, and engine compatibility. It
recommends resolving table identifiers at
read time (late binding) rather than
creation time, and introduces UUID-based
validation to maintain identity guarantees
across engines, or sessions. It also
revises how default-catalog and
default-namespace are handled (defaulting
both to the session context if not
explicitly set) to better align with engine
behavior and improve cross-engine
interoperability.
Please let me know your thoughts.
Thanks,
Walaa.
On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin
Moustafa <wa.moust...@gmail.com> wrote:
Thanks Eduard and Sung! I have
addressed the comments.
One key point to keep in mind is that
catalog names in the spec refer to
logical catalogs—i.e., the first part
of a three-part identifier. These
correspond to Spark's DataSourceV2
catalogs, Trino connectors, and similar
constructs. This is a level of
abstraction above physical catalogs,
which are not referenced or used in the
view spec. The reason is that table
identifiers in the view definition/text
itself refer to logical catalogs, not
physical ones (since they interface
directly with the engine and not a
specific metastore).
Thanks,
Walaa.
On Wed, Apr 16, 2025 at 6:15 AM Sung
Yun <sungwy...@gmail.com> wrote:
Thank you Walaa for the proposal. I
think view portability is a very
important topic for us to continue
discussing as it relies on many
assumptions within the data
ecosystem for it to function like
you've highlighted well in the
document.
I've added a few comments around
how this may impact the permission
questions the engines will be
asking, and whether that is the
desired behavior.
Sung
On Wed, Apr 16, 2025 at 7:32 AM
Eduard Tudenhöfner
<etudenhoef...@apache.org> wrote:
Thanks Walaa for tackling this
problem. I've added a few
comments to get a better
understanding of how this will
look like in the actual
implementation.
Eduard
On Tue, Apr 15, 2025 at 7:09 PM
Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Hi Everyone,
Starting this thread to
resume our discussion on
how to reference table
identifiers from Iceberg
metadata, a key aspect of
the view specification,
particularly in relation to
the MV (materialized view)
extensions.
I had the chance to speak
offline with a few
community members to better
understand how the current
spec is being interpreted.
Those conversations served
as inputs to a new proposal
on how table identifier
references could be
represented in metadata.
You can find the proposal
here [1]. I look forward to
your feedback and working
together to move this
forward so we can finalize
the MV spec as well.
[1]
https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
Thanks,
Walaa.