Thanks Steven! So would you agree that resolution using default-catalog and default-namespace does not provide full determinism, and requires a supporting safety mechanism?
Thanks, Walaa. On Wed, May 7, 2025 at 10:30 PM Steven Wu <stevenz...@gmail.com> wrote: > > If the current model is considered deterministic, do you think > `default-catalog` and `default-namespace` fields provide enough determinism > to eliminate the need for UUIDs when storing table identifiers? > > I am fine with storing UUIDs for table identifiers in the view. Basically, > view creation resolves all referenced tables/views with UUIDs. View > consumers can validate resolved tables/views with the stored UUIDs and fail > the query if mismatch. > > The UUID change doesn't really change the table identifier resolution rule > though. It is more of a safety protection. > > On Wed, May 7, 2025 at 10:02 PM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> Hi Steven, >> >> Thanks for the reply. >> >> > I agree with Dan that we shouldn't solve catalog naming in the Iceberg >> view spec. >> >> To clarify, I don't believe the proposal is trying to solve catalog >> naming. What it’s doing is simply this: >> >> * Proposing that table names inside views resolve the same way as they do >> elsewhere (e.g., queries). >> * Adopting a model that is already widely used and supported in the >> existing ecosystem, which allows for: >> -- Renaming catalog aliases >> -- Swapping catalog implementations behind consistent names >> -- Having different default catalog names across engines that still >> point to the same underlying tables >> >> These are common patterns in production data lakes. Saying Iceberg views >> cannot operate in those environments feels unrealistic. In practice, it >> means the spec breaks down in situations that users encounter regularly. >> >> > The recommendation of using engines’ current catalog and database can >> cause context-dependent resolution results. >> >> * As noted in the doc and earlier replies, fixing a catalog name doesn’t >> actually guarantee determinism either. All the failure scenarios above >> still apply even when a default-catalog is stored. >> * The current spec also allows default-catalog to be null, in which case >> it falls back to the view’s catalog, yet that catalog is determined based >> on how the view is referenced in the query, which would be considered >> non-deterministic based on the same criteria you shared. >> * The only true form of determinism here is UUID-based validation, which >> protects against silent drift in any resolution model. >> >> If the current model is considered deterministic, do you think >> `default-catalog` and `default-namespace` fields provide enough determinism >> to eliminate the need for UUIDs when storing table identifiers? >> Or put another way: Would you be comfortable relying solely on >> default-catalog + default-namespace + table name to re-identify the correct >> table, without UUID validation? >> >> +1 on involving other communities. I’m happy to help facilitate a >> cross-community discussion if we aren’t able to reach a resolution here. >> >> Thanks, >> Walaa. >> >> >> >> On Wed, May 7, 2025 at 9:20 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> I agree with Dan that we shouldn't solve catalog naming in the Iceberg >>> view spec. I am not convinced that the proposed change will make the table >>> identifier resolution more clear and portable. The recommendation of using >>> engines' current catalog and database can cause context dependent >>> resolution results, which seems non-deterministic to me. >>> >>> Walaa, you raised a point in the doc that the current catalog resolution >>> logic (default-catalog field, then view catalog) is challenging and >>> unrealistic for engines (like Spark and Trino). It will be great to get >>> more inputs from the broader community on this part. >>> >>> >>> On Tue, May 6, 2025 at 9:21 AM Benny Chow <btc...@gmail.com> wrote: >>> >>>> In Spark, I believe that the USE commands sets the current catalog and >>>> namespace. This affects both where the view is created and how unqualified >>>> table identifiers are resolved. I also don't see an issue with saving the >>>> current catalog and namespace into the view metadata's default-catalog and >>>> default-namespace fields. >>>> >>>> On Wed, Apr 30, 2025 at 5:12 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> > I think that's the lesser evil compared to Iceberg specifying how >>>>> engines should resolve identifiers >>>>> >>>>> I think this is also similar to the previous point. It is the other >>>>> way around. Right now the spec dictates how to resolve (through employing >>>>> a >>>>> view-specific `default-catalog` field). The proposal is suggesting to get >>>>> out of this space and let engines handle it similar to how they handle all >>>>> identifiers. >>>>> >>>>> On Wed, Apr 30, 2025 at 5:07 PM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> > I thought "default-catalog" could be set via the USE command. >>>>>> >>>>>> Benny, I think this is a misconception or miscommunication. The USE >>>>>> command has no impact on the `default-catalog` field. In fact, the >>>>>> proposal's direction is exactly to establish that USE command should >>>>>> influence how tables are resolved, same like everywhere else. Right now >>>>>> it >>>>>> is not the case under the current spec. >>>>>> >>>>>> >>>>>> On Wed, Apr 30, 2025 at 3:17 PM Benny Chow <btc...@gmail.com> wrote: >>>>>> >>>>>>> > there is no SQL construct today to explicitly set default-catalog >>>>>>> >>>>>>> I thought "default-catalog" could be set via the USE command. >>>>>>> >>>>>>> I generally agree with Dan about requiring consistent catalog >>>>>>> names. I think that's the lesser evil compared to Iceberg specifying >>>>>>> how >>>>>>> engines should resolve identifiers. Another thing to consider is that >>>>>>> identifier resolution can be very expensive at query validation time if >>>>>>> identifiers need to be looked up from a bunch of places. Hopefully, it >>>>>>> should be possible to define a view in such a way that identifiers can >>>>>>> be >>>>>>> resolved on the first try. >>>>>>> >>>>>>> Benny >>>>>>> >>>>>>> On Tue, Apr 29, 2025 at 10:29 PM Walaa Eldin Moustafa < >>>>>>> wa.moust...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Rishabh, >>>>>>>> >>>>>>>> You're right that the proposal touches on two aspects, and >>>>>>>> resolution rules are one of them. The other aspect is the proposal's >>>>>>>> position that table identifiers should be stored in metadata exactly as >>>>>>>> they appear in the view text (e.g., even if they're two-part or >>>>>>>> partially >>>>>>>> qualified), along with their corresponding UUIDs for validation. This >>>>>>>> applies both to referenced input tables and the storage table >>>>>>>> identifier in >>>>>>>> materialized views. >>>>>>>> >>>>>>>> We may be able to converge on this storage format even if we >>>>>>>> haven't yet converged on the resolution fallback rules. I believe both >>>>>>>> resolution strategies currently being discussed would still lead to >>>>>>>> storing >>>>>>>> identifiers in this way. >>>>>>>> >>>>>>>> I'm supportive of moving forward with consensus on the identifier >>>>>>>> storage format. That said, we may continue to run into questions >>>>>>>> related to >>>>>>>> resolution during implementation. For example: Should the storage table >>>>>>>> identifier follow the same default-catalog and default-namespace >>>>>>>> resolution >>>>>>>> behavior as other table references? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> On Tue, Apr 29, 2025 at 10:07 PM Rishabh Bhatia < >>>>>>>> bhatiarishab...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello Walaa, >>>>>>>>> >>>>>>>>> Thanks for starting this discussion. >>>>>>>>> >>>>>>>>> I think we should decouple at least the MV Spec from the proposal >>>>>>>>> to change the current behavior of view resolution. >>>>>>>>> >>>>>>>>> We can continue having the discussion if the current view spec >>>>>>>>> needs to be changed or not. Based on the decision at a later point if >>>>>>>>> required we can update the view resolution rule. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Rishabh >>>>>>>>> >>>>>>>>> On Mon, Apr 28, 2025 at 3:22 PM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Correction of typo: both engines seem to set default-catalog to >>>>>>>>>> the view catalog if it is defined, or to null if the view catalog is >>>>>>>>>> not >>>>>>>>>> defined. >>>>>>>>>> >>>>>>>>>> On Mon, Apr 28, 2025 at 3:06 PM Walaa Eldin Moustafa < >>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Dan, >>>>>>>>>>> >>>>>>>>>>> Thanks again for your response. >>>>>>>>>>> >>>>>>>>>>> I agree that catalog renaming is an environmental event, but >>>>>>>>>>> it's a real one that happens frequently in practice. >>>>>>>>>>> Saying that the Iceberg spec cannot accommodate something as >>>>>>>>>>> common as catalog renaming feels very restrictive, and could make >>>>>>>>>>> the spec >>>>>>>>>>> less practical, even unusable, for real-world deployments. >>>>>>>>>>> I’m sharing this from the perspective of a large data lake >>>>>>>>>>> environment where views are heavily deployed and operationalized. >>>>>>>>>>> >>>>>>>>>>> Further, it's worth noting that the table spec is resilient to >>>>>>>>>>> catalog renaming, but the view spec is not. If we have an >>>>>>>>>>> opportunity to >>>>>>>>>>> make the view spec similarly resilient, I wonder why not? >>>>>>>>>>> Both specifications are deterministic in their definition, but >>>>>>>>>>> one is more fragile to environmental changes than the other. >>>>>>>>>>> Improving >>>>>>>>>>> resilience does not sacrifice determinism. It simply makes views >>>>>>>>>>> safer and >>>>>>>>>>> more portable over time. >>>>>>>>>>> >>>>>>>>>>> Separately, given that there is no SQL construct today to >>>>>>>>>>> explicitly set default-catalog at creation time, what is the >>>>>>>>>>> intuition >>>>>>>>>>> behind how engines like Spark and Trino currently assign >>>>>>>>>>> default-catalog? >>>>>>>>>>> Today, both engines seem to set default-catalog to null if the >>>>>>>>>>> view catalog is defined, or to the view catalog if not. >>>>>>>>>>> What was the intended thought process behind this behavior? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Walaa >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 28, 2025 at 1:33 PM Daniel Weeks <dwe...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Walaa, >>>>>>>>>>>> >>>>>>>>>>>> > tables inside views remain reachable after a catalog rename >>>>>>>>>>>> >>>>>>>>>>>> This problem stems from the exact environmental/configuration >>>>>>>>>>>> issue that we should not be trying to address. I don't think we >>>>>>>>>>>> would >>>>>>>>>>>> expect references to survive a catalog rename. That's not >>>>>>>>>>>> something >>>>>>>>>>>> covered by the spec and needs to be handled separately as a >>>>>>>>>>>> platform-level >>>>>>>>>>>> migration specific to the affected environment. >>>>>>>>>>>> >>>>>>>>>>>> The identifier resolution logic is clear and deterministic. It >>>>>>>>>>>> should not matter whether an engine resolves and encodes the >>>>>>>>>>>> default-catalog or leaves it to the resolution rules. >>>>>>>>>>>> >>>>>>>>>>>> The issue isn't with how the spec is defined, but rather view >>>>>>>>>>>> behavior when you start altering the environment around it, which >>>>>>>>>>>> isn't >>>>>>>>>>>> something we should be trying to define here. >>>>>>>>>>>> >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 28, 2025 at 12:17 PM Walaa Eldin Moustafa < >>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Dan, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for chiming in. >>>>>>>>>>>>> >>>>>>>>>>>>> I believe the issues we’re seeing now go beyond just catalog >>>>>>>>>>>>> naming consistency. The behavior around default-catalog itself >>>>>>>>>>>>> introduces >>>>>>>>>>>>> resolution inconsistencies even when catalog names are consistent. >>>>>>>>>>>>> For example: >>>>>>>>>>>>> >>>>>>>>>>>>> * When default-catalog is set to null, tables inside views >>>>>>>>>>>>> remain reachable after a catalog rename. But if it is set to a >>>>>>>>>>>>> non-null >>>>>>>>>>>>> value, table references will break. >>>>>>>>>>>>> >>>>>>>>>>>>> * default-catalog causes table references inside views to be >>>>>>>>>>>>> early bound (i.e., bound at view creation time, especially when >>>>>>>>>>>>> using a >>>>>>>>>>>>> non-null value), while table references inside standalone queries >>>>>>>>>>>>> are late >>>>>>>>>>>>> bound (bound at query time). This creates inconsistencies when >>>>>>>>>>>>> resolving >>>>>>>>>>>>> the same table name inside and outside views, even within the >>>>>>>>>>>>> same job. >>>>>>>>>>>>> >>>>>>>>>>>>> * It causes Spark's and Trino behavior to drift from the spec. >>>>>>>>>>>>> There is no way to fully align Spark's behavior without making >>>>>>>>>>>>> invasive >>>>>>>>>>>>> changes to the Spark SQL grammar and the View DataSource API >>>>>>>>>>>>> (specifically >>>>>>>>>>>>> on the CREATE side). This challenge would extend to other engines >>>>>>>>>>>>> too. Both >>>>>>>>>>>>> Spark and Trino set this field based on a heuristic in today's >>>>>>>>>>>>> implementation. >>>>>>>>>>>>> >>>>>>>>>>>>> * With view nesting (views depending on views), these >>>>>>>>>>>>> inconsistencies amplify further, forcing users and engines to >>>>>>>>>>>>> reason about >>>>>>>>>>>>> catalog resolution at every level in the view tree. >>>>>>>>>>>>> >>>>>>>>>>>>> * It will be difficult to migrate Hive views to Iceberg with >>>>>>>>>>>>> that model. Migrated Hive views will have to unfollow that spec. >>>>>>>>>>>>> >>>>>>>>>>>>> How would you suggest approaching the engine-level changes >>>>>>>>>>>>> required to support the current default-catalog field? >>>>>>>>>>>>> Also, do you believe the Spark and Trino communities would >>>>>>>>>>>>> align around having table resolution behave inconsistently >>>>>>>>>>>>> between queries >>>>>>>>>>>>> and views, or inconsistency between Iceberg and other types of >>>>>>>>>>>>> views? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Apr 28, 2025 at 11:34 AM Daniel Weeks < >>>>>>>>>>>>> dwe...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I would agree with Jan's summary of why 'default-catalog' was >>>>>>>>>>>>>> introduced, but I think we need to step back and align on what >>>>>>>>>>>>>> we are >>>>>>>>>>>>>> really attempting to support in the spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The issues we're discussing largely stem from using multiple >>>>>>>>>>>>>> engines with cross catalog references and configurations where >>>>>>>>>>>>>> catalog >>>>>>>>>>>>>> names are not aligned. If we have multiple engines that all >>>>>>>>>>>>>> have the same >>>>>>>>>>>>>> catalog names/configurations, the current spec implementation is >>>>>>>>>>>>>> well >>>>>>>>>>>>>> defined for table resolution even across catalogs. The >>>>>>>>>>>>>> 'default-catalog' >>>>>>>>>>>>>> (and namespace equivalent) was intended to address the >>>>>>>>>>>>>> resolution within >>>>>>>>>>>>>> the context of the sql text, not to address catalog/naming >>>>>>>>>>>>>> inconsistencies. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I feel like we're trying to adapt the original intent to >>>>>>>>>>>>>> address the catalog naming/configuration and would argue that we >>>>>>>>>>>>>> shouldn't >>>>>>>>>>>>>> attempt to do that as part of the spec. Inconsistently named >>>>>>>>>>>>>> catalogs are >>>>>>>>>>>>>> a reality, but we should consider that a >>>>>>>>>>>>>> configuration/environmental issue, >>>>>>>>>>>>>> not something to solve for in the spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should support and advocate for consistency in catalog >>>>>>>>>>>>>> naming and define the spec along those lines. The fact is that >>>>>>>>>>>>>> with all of >>>>>>>>>>>>>> the recent work that's gone into making catalogs pluggable, it >>>>>>>>>>>>>> makes more >>>>>>>>>>>>>> sense to just register catalog configuration with consistent >>>>>>>>>>>>>> names (even if >>>>>>>>>>>>>> you have to duplicate the configuration for supporting existing >>>>>>>>>>>>>> readers/writers). I think it's better to provide a path toward >>>>>>>>>>>>>> consistency >>>>>>>>>>>>>> than to normalize complicated schemes to workaround the issues >>>>>>>>>>>>>> caused by >>>>>>>>>>>>>> environmental/configuration inconsistencies. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If the goal is to create clever ways to hack the late binding >>>>>>>>>>>>>> resolution to swap in different catalogs or make references >>>>>>>>>>>>>> contextual, I >>>>>>>>>>>>>> feel like that is something we should strongly discourage as it >>>>>>>>>>>>>> leads to >>>>>>>>>>>>>> confusion about what is resolved as part of the query. >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this point, I don't see a good argument to add >>>>>>>>>>>>>> additional configuration or change the resolution behaviors. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Dan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Apr 28, 2025 at 12:40 AM Jan Kaul >>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think the intention with the "default-catalog" was that >>>>>>>>>>>>>>> every query engine uses it to store its session default catalog >>>>>>>>>>>>>>> at the time >>>>>>>>>>>>>>> of creating the view. This way the view could be reused in >>>>>>>>>>>>>>> another session. >>>>>>>>>>>>>>> The idea was not to introduce an additional SQL syntax to set >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> default-catalog. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Generally we have different environments we want to support >>>>>>>>>>>>>>> with the view spec: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Consistent catalog naming >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When the environment supports it, using consistent catalog >>>>>>>>>>>>>>> names can have a great benefit for multi-catalog, multi-engine >>>>>>>>>>>>>>> setups. With >>>>>>>>>>>>>>> consistent catalog names, using the "default-catalog" field >>>>>>>>>>>>>>> works without >>>>>>>>>>>>>>> any issues. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Inconsistent catalog naming >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This can be the case when different query engines refer to >>>>>>>>>>>>>>> the same physical catalog by different names. This often >>>>>>>>>>>>>>> happens because >>>>>>>>>>>>>>> different query engines use different strategies to setup the >>>>>>>>>>>>>>> catalogs. If >>>>>>>>>>>>>>> catalogs have inconsistent naming, using the "default-catalog" >>>>>>>>>>>>>>> field does >>>>>>>>>>>>>>> not work because it is not guaranteed that the catalog name can >>>>>>>>>>>>>>> be resolved >>>>>>>>>>>>>>> with another engine. Using the "view catalog" as a fallback is >>>>>>>>>>>>>>> a better >>>>>>>>>>>>>>> solution for this use case, as it avoids catalog names >>>>>>>>>>>>>>> altogether. It is >>>>>>>>>>>>>>> however limited to table references in the same catalog. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What do you think of introducing a view property that >>>>>>>>>>>>>>> specifies if the "default-catalog" or the "view catalog" should >>>>>>>>>>>>>>> be used? >>>>>>>>>>>>>>> This way, you could use the "default-catalog" in environments >>>>>>>>>>>>>>> where you can >>>>>>>>>>>>>>> guarantee consistent naming, but you would be able to directly >>>>>>>>>>>>>>> fallback to >>>>>>>>>>>>>>> the "view-catalog" when you don't have consistent naming. The >>>>>>>>>>>>>>> query engines >>>>>>>>>>>>>>> could set the default for this view property at creation time. >>>>>>>>>>>>>>> Spark for >>>>>>>>>>>>>>> example could set it to automatically use the "view catalog". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 4/26/25 05:33, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To help folks catch up on the latest discussions and >>>>>>>>>>>>>>> interpretation of the spec, I have summarized everything we >>>>>>>>>>>>>>> discussed so >>>>>>>>>>>>>>> far at the top of the proposal document (here >>>>>>>>>>>>>>> <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>). >>>>>>>>>>>>>>> I have slightly updated the proposal to be in sync with the new >>>>>>>>>>>>>>> interpretation to avoid confusion. In summary: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * Remove default-catalog and default-namespace fields from >>>>>>>>>>>>>>> the view spec completely. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * Hence, we do not attempt to define separate view-level >>>>>>>>>>>>>>> default catalogs or namespaces. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Instead: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * If a table identifier inside a view lacks a catalog >>>>>>>>>>>>>>> qualifier, engines should resolve it using the current engine >>>>>>>>>>>>>>> catalog at >>>>>>>>>>>>>>> query time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * Reference table identifiers in the metadata exactly as >>>>>>>>>>>>>>> they appear in the view SQL text. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * If an identifier lacks the catalog part at creation, it >>>>>>>>>>>>>>> should still lack a catalog in the stored metadata. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> * Store UUIDs alongside table identifiers whenever possible. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the contribution Benny! +1 to the confusion the >>>>>>>>>>>>>>>> fallback creates. Also just to be clear, at this point and >>>>>>>>>>>>>>>> after clarifying >>>>>>>>>>>>>>>> the current spec intentions, I am convinced that we should >>>>>>>>>>>>>>>> remove the >>>>>>>>>>>>>>>> default catalog and default namespace fields altogether. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 5:13 PM Benny Chow < >>>>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'd like to contribute my opinions on this: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - I don't particularly like the current behavior of >>>>>>>>>>>>>>>>> "default to the view's catalog when default-catalog is not >>>>>>>>>>>>>>>>> set". >>>>>>>>>>>>>>>>> Fundamentally, I believe the intent of default-catalog and >>>>>>>>>>>>>>>>> default-namespace is there to help users write more concise >>>>>>>>>>>>>>>>> SQL. >>>>>>>>>>>>>>>>> - spark session catalog is engine specific and I don't >>>>>>>>>>>>>>>>> think we should design something that says first use this >>>>>>>>>>>>>>>>> catalog, then >>>>>>>>>>>>>>>>> that catalog.. or that catalog. For example, resolving >>>>>>>>>>>>>>>>> identifiers using >>>>>>>>>>>>>>>>> default-catalog -> view's catalog -> session catalog is not >>>>>>>>>>>>>>>>> good. >>>>>>>>>>>>>>>>> - We gotta support non-Iceberg tables otherwise I see no >>>>>>>>>>>>>>>>> value in putting views in the catalog to share with other >>>>>>>>>>>>>>>>> engines >>>>>>>>>>>>>>>>> - Interoperability between different engine types is very >>>>>>>>>>>>>>>>> hard due to dialect issues... so I think we should focus on >>>>>>>>>>>>>>>>> supporting >>>>>>>>>>>>>>>>> different clusters of the same engine type on a shared >>>>>>>>>>>>>>>>> catalog. For >>>>>>>>>>>>>>>>> example, AI and BI clusters on Spark sharing the same views >>>>>>>>>>>>>>>>> in a REST >>>>>>>>>>>>>>>>> catalog. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Coincidentally, I think the ultimate solution is along the >>>>>>>>>>>>>>>>> lines of something Russell proposed last year: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We've been looking at this interoperable identifier >>>>>>>>>>>>>>>>> problem through the lens of catalog resolution but maybe the >>>>>>>>>>>>>>>>> right approach >>>>>>>>>>>>>>>>> is really about templating. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I would extend Russell's idea to allow identifiers in a >>>>>>>>>>>>>>>>> view to span catalogs to support non-Iceberg tables. Also, >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> default-catalog property could be templated as well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks Steven! How do you recommend making Spark >>>>>>>>>>>>>>>>>> implementation conform to the spec? Do we need Spark SQL >>>>>>>>>>>>>>>>>> extensions and/or >>>>>>>>>>>>>>>>>> Spark catalog APIs for that? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> How do you recommend reconciling the inconsistencies I >>>>>>>>>>>>>>>>>> shared regarding many resolution methods not consistently >>>>>>>>>>>>>>>>>> being followed in >>>>>>>>>>>>>>>>>> different scenarios (view vs child table resolution, query >>>>>>>>>>>>>>>>>> vs view >>>>>>>>>>>>>>>>>> resolution)? Note these occur when the default catalog is >>>>>>>>>>>>>>>>>> set to a non-null >>>>>>>>>>>>>>>>>> value. If it helps, I can share concrete examples. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu < >>>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The core issue is on the fall back behavior when >>>>>>>>>>>>>>>>>>> `default-catalog` is >>>>>>>>>>>>>>>>>>> not defined. Current view spec says the fallback should >>>>>>>>>>>>>>>>>>> be the catalog >>>>>>>>>>>>>>>>>>> where the view is defined. It doesn't really matter what >>>>>>>>>>>>>>>>>>> the catalog >>>>>>>>>>>>>>>>>>> is named (catalogX) by the read engine. >>>>>>>>>>>>>>>>>>> - If a view refers to the tables in the same catalog, >>>>>>>>>>>>>>>>>>> this is a >>>>>>>>>>>>>>>>>>> non-ambiguous and reasonable fallback behavior. >>>>>>>>>>>>>>>>>>> - If a view refers to tables from another catalog, >>>>>>>>>>>>>>>>>>> catalog names >>>>>>>>>>>>>>>>>>> should be included in the reference name already. So no >>>>>>>>>>>>>>>>>>> ambiguity >>>>>>>>>>>>>>>>>>> there either. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Potential inconsistent naming of catalog is a separate >>>>>>>>>>>>>>>>>>> problem, which >>>>>>>>>>>>>>>>>>> Iceberg view spec probably cannot solve. We can only >>>>>>>>>>>>>>>>>>> recommend that >>>>>>>>>>>>>>>>>>> catalog should be named consistently across usage for >>>>>>>>>>>>>>>>>>> better >>>>>>>>>>>>>>>>>>> interoperability on name references. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This proposal is to change the fallback behavior to >>>>>>>>>>>>>>>>>>> engine's session >>>>>>>>>>>>>>>>>>> default catalog. I am not sure it is better than the >>>>>>>>>>>>>>>>>>> current fallback >>>>>>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > Today’s Spark behavior explicitly differs from this >>>>>>>>>>>>>>>>>>> idea. Spark resolves table identifiers during view creation >>>>>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>>>>> session’s default catalog, not a supplied `default-catalog`. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I would argue that is a Spark implementation issue for >>>>>>>>>>>>>>>>>>> not conforming >>>>>>>>>>>>>>>>>>> to the spec. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >>>>>>>>>>>>>>>>>>> <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > Hi Jan, >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > Thanks again for continuing the discussion. I want to >>>>>>>>>>>>>>>>>>> highlight a few fundamental issues around the >>>>>>>>>>>>>>>>>>> interpretation of >>>>>>>>>>>>>>>>>>> default-catalog: >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > Here is the real catch: >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > * default-catalog cannot logically be defined at view >>>>>>>>>>>>>>>>>>> creation time. It would be circular: the view needs to >>>>>>>>>>>>>>>>>>> exist before its >>>>>>>>>>>>>>>>>>> metadata (and hence default-catalog) can exist. This is >>>>>>>>>>>>>>>>>>> visible in Spark’s >>>>>>>>>>>>>>>>>>> implementation, where `default-catalog` is not used. >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > * Introducing a creation-time default-catalog setting >>>>>>>>>>>>>>>>>>> would require extending SQL syntax and engine APIs to >>>>>>>>>>>>>>>>>>> promote it to a >>>>>>>>>>>>>>>>>>> first-class view concept. This would be intrusive, >>>>>>>>>>>>>>>>>>> non-intuitive, and >>>>>>>>>>>>>>>>>>> realistically very difficult to standardize across engines. >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > * Today’s Spark behavior explicitly differs from this >>>>>>>>>>>>>>>>>>> idea. Spark resolves table identifiers during view creation >>>>>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>>>>> session’s default catalog, not a supplied `default-catalog`. >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > * Hypothetically even if we patched in a creation-time >>>>>>>>>>>>>>>>>>> default-catalog, it would create an inconsistent binding >>>>>>>>>>>>>>>>>>> model between >>>>>>>>>>>>>>>>>>> tables vs views (early vs late), and between tables in >>>>>>>>>>>>>>>>>>> views and in queries >>>>>>>>>>>>>>>>>>> (again early vs late). For example, views and tables in >>>>>>>>>>>>>>>>>>> queries can >>>>>>>>>>>>>>>>>>> withstand default catalog renames, but tables cannot when >>>>>>>>>>>>>>>>>>> they are used >>>>>>>>>>>>>>>>>>> inside views -- it even applies to views inside views, >>>>>>>>>>>>>>>>>>> which makes this >>>>>>>>>>>>>>>>>>> very hard to reason about considering nesting. >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>>>>>>> > Walaa >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> @Walaa: >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> I would argue that when you run a CREATE VIEW >>>>>>>>>>>>>>>>>>> statement the query engine knowns which catalog the view is >>>>>>>>>>>>>>>>>>> being created >>>>>>>>>>>>>>>>>>> in. So even though we typically use late binding to resolve >>>>>>>>>>>>>>>>>>> the view >>>>>>>>>>>>>>>>>>> catalog at query time, it can also be used at creation time. >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> The query engine would need to keep track of the >>>>>>>>>>>>>>>>>>> "view catalog" where the view is going to be created in. It >>>>>>>>>>>>>>>>>>> can use that >>>>>>>>>>>>>>>>>>> catalog to resolve partial table identifiers if >>>>>>>>>>>>>>>>>>> "default-catalog" is not >>>>>>>>>>>>>>>>>>> set. >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> It can lead to some unintuitive behavior, where >>>>>>>>>>>>>>>>>>> partial identifiers in the view query resolve to a >>>>>>>>>>>>>>>>>>> different catalog >>>>>>>>>>>>>>>>>>> compared to using them outside of a view. >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * >>>>>>>>>>>>>>>>>>> from sales.orders; >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> If the session default catalog is not "catalogA", the >>>>>>>>>>>>>>>>>>> "sales.orders" in the view query would not be the same as >>>>>>>>>>>>>>>>>>> just referencing >>>>>>>>>>>>>>>>>>> "sales.orders" in a normal SQL statement. This is because >>>>>>>>>>>>>>>>>>> without a >>>>>>>>>>>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would >>>>>>>>>>>>>>>>>>> default to >>>>>>>>>>>>>>>>>>> "catalogA", which is the view's catalog. >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> Thanks, >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> Jan >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> On 4/25/25 04:05, Manu Zhang wrote: >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> For example, if we want to validate that the tables >>>>>>>>>>>>>>>>>>> referenced in the view exist, how can we do that when >>>>>>>>>>>>>>>>>>> default-catalog isn't >>>>>>>>>>>>>>>>>>> defined, since the view hasn't been created or loaded yet? >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> I don't think this is related to view spec. How do we >>>>>>>>>>>>>>>>>>> validate that a table exists without a default catalog, or >>>>>>>>>>>>>>>>>>> do we always use >>>>>>>>>>>>>>>>>>> the current session catalog? >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> Thanks, >>>>>>>>>>>>>>>>>>> >> Manu >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> Hi Jan, >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> I think we still share the same understanding. Just >>>>>>>>>>>>>>>>>>> to clarify: when I referred to late binding as “similar” to >>>>>>>>>>>>>>>>>>> the proposal, I >>>>>>>>>>>>>>>>>>> was acknowledging the distinction between view-level and >>>>>>>>>>>>>>>>>>> table-level >>>>>>>>>>>>>>>>>>> resolution. But as you noted, both follow a late binding >>>>>>>>>>>>>>>>>>> model. >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> That said, this still raises an interesting question >>>>>>>>>>>>>>>>>>> and a potential gap: if default-catalog is only defined at >>>>>>>>>>>>>>>>>>> query time, how >>>>>>>>>>>>>>>>>>> should resolution work during view creation? For example, >>>>>>>>>>>>>>>>>>> if we want to >>>>>>>>>>>>>>>>>>> validate that the tables referenced in the view exist, how >>>>>>>>>>>>>>>>>>> can we do that >>>>>>>>>>>>>>>>>>> when default-catalog isn't defined, since the view hasn't >>>>>>>>>>>>>>>>>>> been created or >>>>>>>>>>>>>>>>>>> loaded yet? >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> Thanks, >>>>>>>>>>>>>>>>>>> >>> Walaa. >>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Yes, I have the same understanding. The view >>>>>>>>>>>>>>>>>>> catalog is resolved at query time. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> As you mentioned before, it's good to distinguish >>>>>>>>>>>>>>>>>>> between the physical catalog and it's reference used in SQL >>>>>>>>>>>>>>>>>>> statements. The >>>>>>>>>>>>>>>>>>> important part is that the physical catalog of the view and >>>>>>>>>>>>>>>>>>> the tables >>>>>>>>>>>>>>>>>>> referenced in it's definition stay consistent. You could >>>>>>>>>>>>>>>>>>> create a view in a >>>>>>>>>>>>>>>>>>> given physical catalog by referring to it as "catalogA", as >>>>>>>>>>>>>>>>>>> in your first >>>>>>>>>>>>>>>>>>> point. If you then, given a different setup, refer to the >>>>>>>>>>>>>>>>>>> same physical >>>>>>>>>>>>>>>>>>> catalog as "catalogB" in another session/environment, the >>>>>>>>>>>>>>>>>>> behavior should >>>>>>>>>>>>>>>>>>> still work. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> I would however rephrase your last point. Late >>>>>>>>>>>>>>>>>>> binding applies to the view catalog name and by extension >>>>>>>>>>>>>>>>>>> to all partial >>>>>>>>>>>>>>>>>>> table references when no "default-catalog" is present. >>>>>>>>>>>>>>>>>>> Resolving the view >>>>>>>>>>>>>>>>>>> catalog name at query time is not opposed to storing the >>>>>>>>>>>>>>>>>>> view metadata in a >>>>>>>>>>>>>>>>>>> catalog. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Or maybe I don't entirely understand what you mean. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Thanks >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Jan >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Hi Jan, >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> > The view is executed when it's being referenced >>>>>>>>>>>>>>>>>>> in a SQL statement. That statement contains the information >>>>>>>>>>>>>>>>>>> for the query >>>>>>>>>>>>>>>>>>> engine to resolve the catalog of the view. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> If I’m understanding correctly, that means: >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> * If the view is queried as SELECT * FROM >>>>>>>>>>>>>>>>>>> catalogA.namespace.view, then catalogA is considered the >>>>>>>>>>>>>>>>>>> view’s catalog. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> * If the same view is later queried as SELECT * >>>>>>>>>>>>>>>>>>> FROM catalogB.namespace.view (after renaming catalogA to >>>>>>>>>>>>>>>>>>> catalogB, and >>>>>>>>>>>>>>>>>>> keeping everything else the same), then catalogB becomes >>>>>>>>>>>>>>>>>>> the view’s catalog. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Is that interpretation correct? If so, it sounds to >>>>>>>>>>>>>>>>>>> me like the catalog is resolved at query time, based on how >>>>>>>>>>>>>>>>>>> the view is >>>>>>>>>>>>>>>>>>> referenced, not from any stored metadata. That would imply >>>>>>>>>>>>>>>>>>> some sort of a >>>>>>>>>>>>>>>>>>> late binding behavior (similar to the proposal), as opposed >>>>>>>>>>>>>>>>>>> to using some >>>>>>>>>>>>>>>>>>> catalog that "stores" the view definition. >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>> Walaa >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Hi Walaa, >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Thanks for clarifying the aspects of >>>>>>>>>>>>>>>>>>> non-determinism. Let me try to address your questions. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> 1. This is my interpretation of the current spec: >>>>>>>>>>>>>>>>>>> The view is executed when it's being referenced in a SQL >>>>>>>>>>>>>>>>>>> statement. That >>>>>>>>>>>>>>>>>>> statement contains the information for the query engine to >>>>>>>>>>>>>>>>>>> resolve the >>>>>>>>>>>>>>>>>>> catalog of the view. The query engine then uses that >>>>>>>>>>>>>>>>>>> information to fetch >>>>>>>>>>>>>>>>>>> the view metadata from the catalog. It also needs to >>>>>>>>>>>>>>>>>>> temporarily keep track >>>>>>>>>>>>>>>>>>> of which catalog it used to fetch the view metadata. It can >>>>>>>>>>>>>>>>>>> then use that >>>>>>>>>>>>>>>>>>> information to resolve the table references in the views >>>>>>>>>>>>>>>>>>> SQL definition in >>>>>>>>>>>>>>>>>>> case no default catalog is specified. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> 2. The important part is that the catalog can be >>>>>>>>>>>>>>>>>>> referenced at execution time. As long as that's the case I >>>>>>>>>>>>>>>>>>> would assume the >>>>>>>>>>>>>>>>>>> view can be created in any catalog. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> I think your point is really valuable because the >>>>>>>>>>>>>>>>>>> current specification can lead to some unintuitive >>>>>>>>>>>>>>>>>>> behavior. For example >>>>>>>>>>>>>>>>>>> for the following statement: >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS >>>>>>>>>>>>>>>>>>> SELECT * from sales.orders; >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> If the session default catalog is not "catalogA", >>>>>>>>>>>>>>>>>>> the "sales.orders" in the view query would not be the same >>>>>>>>>>>>>>>>>>> as just >>>>>>>>>>>>>>>>>>> referencing "sales.orders" in a normal SQL statement. This >>>>>>>>>>>>>>>>>>> is because >>>>>>>>>>>>>>>>>>> without a "default-catalog", the catalog name of >>>>>>>>>>>>>>>>>>> "sales.orders" would >>>>>>>>>>>>>>>>>>> default to "catalogA". >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> However, I like the current design of the view >>>>>>>>>>>>>>>>>>> spec, because it has the "closure" property. Because of the >>>>>>>>>>>>>>>>>>> fact that the >>>>>>>>>>>>>>>>>>> "view catalog" has to be known when executing a view, all >>>>>>>>>>>>>>>>>>> the information >>>>>>>>>>>>>>>>>>> required to resolve the table identifiers is contained in >>>>>>>>>>>>>>>>>>> the view metadata >>>>>>>>>>>>>>>>>>> (and the "view catalog"). I think that if you make the >>>>>>>>>>>>>>>>>>> identifier >>>>>>>>>>>>>>>>>>> resolution dependent on external parameters, it hinders >>>>>>>>>>>>>>>>>>> portability. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Jan >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Hi Jan, >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Thanks for the thoughtful feedback. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> I think it’s important we clarify a key point >>>>>>>>>>>>>>>>>>> before going deeper: >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Non-determinism is not caused by session fallback >>>>>>>>>>>>>>>>>>> behavior—it’s a fundamental limitation of using table >>>>>>>>>>>>>>>>>>> identifiers alone, >>>>>>>>>>>>>>>>>>> regardless of whether we use the current rule, the proposed >>>>>>>>>>>>>>>>>>> fallback to the >>>>>>>>>>>>>>>>>>> session’s default catalog, or even early vs. late binding. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> The same fully qualified identifier (e.g., >>>>>>>>>>>>>>>>>>> catalogA.namespace.table) can resolve to different objects >>>>>>>>>>>>>>>>>>> depending solely >>>>>>>>>>>>>>>>>>> on engine-specific routing logic or catalog aliases. So >>>>>>>>>>>>>>>>>>> determinism isn’t >>>>>>>>>>>>>>>>>>> guaranteed just because an identifier is "fully qualified." >>>>>>>>>>>>>>>>>>> The only >>>>>>>>>>>>>>>>>>> reliable anchor for identity is the UUID. That’s why the >>>>>>>>>>>>>>>>>>> proposed use of >>>>>>>>>>>>>>>>>>> UUIDs is not just a hardening strategy. It’s the actual fix >>>>>>>>>>>>>>>>>>> for correctness. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> To move the conversation forward, could you help >>>>>>>>>>>>>>>>>>> clarify two things in the context of the current spec: >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> * Where in the metadata is the “view catalog” >>>>>>>>>>>>>>>>>>> stored, so that an engine knows to fall back to it if >>>>>>>>>>>>>>>>>>> default-catalog is >>>>>>>>>>>>>>>>>>> null? >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> * Are we even allowed to create views in the >>>>>>>>>>>>>>>>>>> session's default catalog (i.e., without specifying a >>>>>>>>>>>>>>>>>>> catalog) in the >>>>>>>>>>>>>>>>>>> current Iceberg spec? >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> These questions are important because if we can’t >>>>>>>>>>>>>>>>>>> unambiguously recover the "view catalog" from metadata, >>>>>>>>>>>>>>>>>>> then defaulting to >>>>>>>>>>>>>>>>>>> it is problematic. And if views can't be created in the >>>>>>>>>>>>>>>>>>> default catalog, >>>>>>>>>>>>>>>>>>> then the fallback rule doesn’t generalize. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>> Walaa. >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Hi Walaa, >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> thank you for your proposal. If I understood >>>>>>>>>>>>>>>>>>> correctly, you proposal is composed of three parts: >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> - session default catalog as fallback for >>>>>>>>>>>>>>>>>>> "default-catalog" >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> - session default namespace as fallback for >>>>>>>>>>>>>>>>>>> "default-namepace" >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> - Late binding + UUID validation >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> I have some comments regarding these points. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> 1. Session default catalog as fallback for >>>>>>>>>>>>>>>>>>> "default-catalog" >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Introducing a behavior that depends on the >>>>>>>>>>>>>>>>>>> current session setup is in my opinion the definition of >>>>>>>>>>>>>>>>>>> "non-determinism". >>>>>>>>>>>>>>>>>>> You could be running the same query-engine and >>>>>>>>>>>>>>>>>>> catalog-setup on different >>>>>>>>>>>>>>>>>>> days, with different default session catalogs (which is >>>>>>>>>>>>>>>>>>> rather common), and >>>>>>>>>>>>>>>>>>> would be getting different results. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Whereas with the current behavior, the view >>>>>>>>>>>>>>>>>>> always produces the same results. The current behavior has >>>>>>>>>>>>>>>>>>> some rough edges >>>>>>>>>>>>>>>>>>> in very niche use cases but I think is solid for most uses >>>>>>>>>>>>>>>>>>> cases. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> 2. Session default namespace as fallback for >>>>>>>>>>>>>>>>>>> "default-namespace" >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Similar to the above. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> 3. Late binding + UUID validation >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> If I understand it correctly, the current >>>>>>>>>>>>>>>>>>> implementation already uses late binding. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Generally, having UUID validation makes the setup >>>>>>>>>>>>>>>>>>> more robust. Which is great. However, having UUID >>>>>>>>>>>>>>>>>>> validation still requires >>>>>>>>>>>>>>>>>>> us to have a portable table identifier specification. Even >>>>>>>>>>>>>>>>>>> if we have the >>>>>>>>>>>>>>>>>>> UUIDs of the referenced tables from the view, there simply >>>>>>>>>>>>>>>>>>> isn't an >>>>>>>>>>>>>>>>>>> interface that let's us use those UUIDs. The catalog >>>>>>>>>>>>>>>>>>> interface is defined >>>>>>>>>>>>>>>>>>> in terms of table identifiers. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> So we always require a working catalog setup and >>>>>>>>>>>>>>>>>>> suiting table identifiers to obtain the table metadata. We >>>>>>>>>>>>>>>>>>> can use the >>>>>>>>>>>>>>>>>>> UUIDs to verify if we loaded the correct table. But this >>>>>>>>>>>>>>>>>>> can only be done >>>>>>>>>>>>>>>>>>> after we used some identifier. Which means there is no way >>>>>>>>>>>>>>>>>>> of using UUIDs >>>>>>>>>>>>>>>>>>> without a functioning catalog/identifier setup. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> In conclusion, I prefer the current behavior for >>>>>>>>>>>>>>>>>>> "default-catalog" because it is more deterministic in my >>>>>>>>>>>>>>>>>>> opinion. And I >>>>>>>>>>>>>>>>>>> think the current spec does a good job for multi-engine >>>>>>>>>>>>>>>>>>> table identifier >>>>>>>>>>>>>>>>>>> resolution. I see the UUID validation more of an additional >>>>>>>>>>>>>>>>>>> hardening >>>>>>>>>>>>>>>>>>> strategy. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Thanks >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Jan >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Thanks Renjie! >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> The existing spec has some guidance on resolving >>>>>>>>>>>>>>>>>>> catalogs on the fly already (to address the case of view >>>>>>>>>>>>>>>>>>> text with table >>>>>>>>>>>>>>>>>>> identifiers missing the catalog part). The guidance is to >>>>>>>>>>>>>>>>>>> use the catalog >>>>>>>>>>>>>>>>>>> where the view is stored. But I find this rule hard to >>>>>>>>>>>>>>>>>>> interpret or use. >>>>>>>>>>>>>>>>>>> The catalog itself is a logical construct—such as a >>>>>>>>>>>>>>>>>>> federated catalog that >>>>>>>>>>>>>>>>>>> delegates to multiple physical backends (e.g., HMS and >>>>>>>>>>>>>>>>>>> REST). In such >>>>>>>>>>>>>>>>>>> cases, the catalog (e.g., `my_catalog` in >>>>>>>>>>>>>>>>>>> `my_catalog.namespace1.table1`) >>>>>>>>>>>>>>>>>>> doesn’t physically store the tables; it only routes >>>>>>>>>>>>>>>>>>> requests to underlying >>>>>>>>>>>>>>>>>>> stores. Therefore, defaulting identifier resolution based >>>>>>>>>>>>>>>>>>> on the catalog >>>>>>>>>>>>>>>>>>> where the view is "stored" doesn’t align with how catalogs >>>>>>>>>>>>>>>>>>> actually behave >>>>>>>>>>>>>>>>>>> in practice. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>> Walaa. >>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >>>>>>>>>>>>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> Hi, Walaa: >>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> Thanks for the proposal. >>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> I've reviewed the doc, but in general I have >>>>>>>>>>>>>>>>>>> some concerns with resolving catalog names on the fly with >>>>>>>>>>>>>>>>>>> query engine >>>>>>>>>>>>>>>>>>> defined catalog names. This introduces some flexibility at >>>>>>>>>>>>>>>>>>> first glance, >>>>>>>>>>>>>>>>>>> but also makes misconfiguration difficult to explain. >>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> But I agree with one part that we should store >>>>>>>>>>>>>>>>>>> resolved table uuid in view metadata, as table/view >>>>>>>>>>>>>>>>>>> renaming may introduce >>>>>>>>>>>>>>>>>>> errors that's difficult to understand for user. >>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin >>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> Looking forward to keeping up the momentum and >>>>>>>>>>>>>>>>>>> closing out the MV spec as well. I’m hoping we can proceed >>>>>>>>>>>>>>>>>>> to a vote next >>>>>>>>>>>>>>>>>>> week. >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> Here is a summary in case that helps. The >>>>>>>>>>>>>>>>>>> proposal outlines a strategy for handling table identifiers >>>>>>>>>>>>>>>>>>> in Iceberg view >>>>>>>>>>>>>>>>>>> metadata, with the goal of ensuring correctness, >>>>>>>>>>>>>>>>>>> portability, and engine >>>>>>>>>>>>>>>>>>> compatibility. It recommends resolving table identifiers at >>>>>>>>>>>>>>>>>>> read time (late >>>>>>>>>>>>>>>>>>> binding) rather than creation time, and introduces >>>>>>>>>>>>>>>>>>> UUID-based validation to >>>>>>>>>>>>>>>>>>> maintain identity guarantees across engines, or sessions. >>>>>>>>>>>>>>>>>>> It also revises >>>>>>>>>>>>>>>>>>> how default-catalog and default-namespace are handled >>>>>>>>>>>>>>>>>>> (defaulting both to >>>>>>>>>>>>>>>>>>> the session context if not explicitly set) to better align >>>>>>>>>>>>>>>>>>> with engine >>>>>>>>>>>>>>>>>>> behavior and improve cross-engine interoperability. >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> Please let me know your thoughts. >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin >>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the >>>>>>>>>>>>>>>>>>> comments. >>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>> One key point to keep in mind is that catalog >>>>>>>>>>>>>>>>>>> names in the spec refer to logical catalogs—i.e., the first >>>>>>>>>>>>>>>>>>> part of a >>>>>>>>>>>>>>>>>>> three-part identifier. These correspond to Spark's >>>>>>>>>>>>>>>>>>> DataSourceV2 catalogs, >>>>>>>>>>>>>>>>>>> Trino connectors, and similar constructs. This is a level >>>>>>>>>>>>>>>>>>> of abstraction >>>>>>>>>>>>>>>>>>> above physical catalogs, which are not referenced or used >>>>>>>>>>>>>>>>>>> in the view spec. >>>>>>>>>>>>>>>>>>> The reason is that table identifiers in the view >>>>>>>>>>>>>>>>>>> definition/text itself >>>>>>>>>>>>>>>>>>> refer to logical catalogs, not physical ones (since they >>>>>>>>>>>>>>>>>>> interface directly >>>>>>>>>>>>>>>>>>> with the engine and not a specific metastore). >>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun < >>>>>>>>>>>>>>>>>>> sungwy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thank you Walaa for the proposal. I think >>>>>>>>>>>>>>>>>>> view portability is a very important topic for us to >>>>>>>>>>>>>>>>>>> continue discussing as >>>>>>>>>>>>>>>>>>> it relies on many assumptions within the data ecosystem for >>>>>>>>>>>>>>>>>>> it to function >>>>>>>>>>>>>>>>>>> like you've highlighted well in the document. >>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>> I've added a few comments around how this may >>>>>>>>>>>>>>>>>>> impact the permission questions the engines will be asking, >>>>>>>>>>>>>>>>>>> and whether >>>>>>>>>>>>>>>>>>> that is the desired behavior. >>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>> Sung >>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard >>>>>>>>>>>>>>>>>>> Tudenhöfner <etudenhoef...@apache.org> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've >>>>>>>>>>>>>>>>>>> added a few comments to get a better understanding of how >>>>>>>>>>>>>>>>>>> this will look >>>>>>>>>>>>>>>>>>> like in the actual implementation. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Eduard >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin >>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Starting this thread to resume our >>>>>>>>>>>>>>>>>>> discussion on how to reference table identifiers from >>>>>>>>>>>>>>>>>>> Iceberg metadata, a >>>>>>>>>>>>>>>>>>> key aspect of the view specification, particularly in >>>>>>>>>>>>>>>>>>> relation to the MV >>>>>>>>>>>>>>>>>>> (materialized view) extensions. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> I had the chance to speak offline with a >>>>>>>>>>>>>>>>>>> few community members to better understand how the current >>>>>>>>>>>>>>>>>>> spec is being >>>>>>>>>>>>>>>>>>> interpreted. Those conversations served as inputs to a new >>>>>>>>>>>>>>>>>>> proposal on how >>>>>>>>>>>>>>>>>>> table identifier references could be represented in >>>>>>>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> You can find the proposal here [1]. I look >>>>>>>>>>>>>>>>>>> forward to your feedback and working together to move this >>>>>>>>>>>>>>>>>>> forward so we >>>>>>>>>>>>>>>>>>> can finalize the MV spec as well. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>