Thank you all for your valuable input on this topic. I apologize for not being able to keep up with the discussion last week, as I was traveling during the U.S. Thanksgiving holiday.
To summarize, I think we have the agreement on following points: - tableExists API shall indicate whether a table identifier exists in the catalog, regardless of metadata corruption - Currently, we attempt to load the table first to determine table existence, but this is not necessary - For Hive catalog, API can return true if the table identifier exists in HMS and it is an iceberg table. - There are no plans to change other existing behaviors. I've addressed the feedback from reviewers and also added explicit tests coverage, PR is ready for another look in https://github.com/apache/iceberg/pull/11597. Thanks, Steve Zhang > On Nov 27, 2024, at 10:15 PM, Péter Váry <peter.vary.apa...@gmail.com> wrote: > > +1 from my side too. > > I wanted to make sure that the community is aware of this change which will > bring behavioral difference compared to other catalogs. This is why I have > asked Steve to start this thread. > > > On Thu, Nov 28, 2024, 02:10 Szehon Ho <szehon.apa...@gmail.com > <mailto:szehon.apa...@gmail.com>> wrote: >> Yea, I think that part is definitely kept. >> >> Thanks >> Szehon >> >> On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <mailto:rdb...@gmail.com> >> <rdb...@gmail.com <mailto:rdb...@gmail.com>> wrote: >>> I'd support changing the behavior if we still have a way to match the >>> intent, which is to return true if the table exists in Hive and is an >>> Iceberg table. >>> >>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com >>> <mailto:szehon.apa...@gmail.com>> wrote: >>>> Hm I think the thread got a bit sidetracked by the other question. >>>> >>>> The initial proposal by Steve is a performance improvement for >>>> HiveCatalog's tableExists(). Currently it loads both Hive and Iceberg >>>> table metadata, and if successful returns true. The proposal is to load >>>> from Hive only, and return true if Hive metadata identifies that an >>>> Iceberg table exists with this name. >>>> >>>> Checking corruption of Iceberg's table metadata.json is a side-effect of >>>> the current behavior, but would not anymore with the proposed change. >>>> That's the question of the original thread, and so far there's agreement >>>> that it is not necessarily part of this scope of HiveCatalog's >>>> tableExists(). >>>> >>>> At least this is my understanding. >>>> Thanks, >>>> Szehon >>>> >>>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com >>>> <mailto:rdb...@gmail.com> <rdb...@gmail.com <mailto:rdb...@gmail.com>> >>>> wrote: >>>>> What kind of corruption are you referring to? I would expect corruption >>>>> to result in an exception when loading the table, but that the table >>>>> should still exist. The problem is likely that we determine if a table >>>>> exists by attempting to load it. We could fix that by not attempting to >>>>> load the table. I think that's a reasonable solution. >>>>> >>>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com >>>>> <mailto:owenzhang1...@gmail.com>> wrote: >>>>>>> The current behavior's intent is not to check whether the metadata is >>>>>>> valid, it is to detect whether the table is an Iceberg table. >>>>>> >>>>>> Is there a way to detect this from HiveCatalog without loading the >>>>>> table? >>>>>> >>>>>> >>>>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com >>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote: >>>>>>> I think we have an agreement, not to change the behavior wrt existing >>>>>>> non-Iceberg tables, and throw an exception. >>>>>>> >>>>>>> Are we also in agreement with the original proposal to return true when >>>>>>> the table exists but the metadata is somehow corrupted? Note: this is >>>>>>> the proposed change of behavior why the thread was originally started. >>>>>>> >>>>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <mailto:rdb...@gmail.com> >>>>>>> <rdb...@gmail.com <mailto:rdb...@gmail.com>> wrote: >>>>>>>> I'd argue against changing this. The current behavior's intent is not >>>>>>>> to check whether the metadata is valid, it is to detect whether the >>>>>>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing >>>>>>>> that behavior would be surprising, especially if we started throwing >>>>>>>> exceptions. >>>>>>>> >>>>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org >>>>>>>> <mailto:kevinjq...@apache.org>> wrote: >>>>>>>>> > Should add, my personal preference is probably not to change the >>>>>>>>> > existing behavior for this part >>>>>>>>> >>>>>>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>>>>>> implementation has this problem too. >>>>>>>>> It would be good to have a test case specifically for this edge case >>>>>>>>> and maybe call this out in the documentation. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kevin Liu >>>>>>>>> >>>>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com >>>>>>>>> <mailto:szehon.apa...@gmail.com>> wrote: >>>>>>>>>> Should add, my personal preference is probably not to change the >>>>>>>>>> existing behavior for this part (false, if exists a Hive table with >>>>>>>>>> same name) at the moment, just adding another possibility for >>>>>>>>>> consideration. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Szehon >>>>>>>>>> >>>>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com >>>>>>>>>> <mailto:szehon.apa...@gmail.com>> wrote: >>>>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion. I guess >>>>>>>>>>> a third option instead of returning true/false in this case, is to >>>>>>>>>>> change it to throw an NoSuchIcebergTableException if its a >>>>>>>>>>> non-Iceberg table, which I think is actually what this pr does? >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Szehon >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>>>>>>>>> Hey, >>>>>>>>>>>> >>>>>>>>>>>> I think what Kevin says makes sense. However, it would then >>>>>>>>>>>> confuse the opposite use case of this function. Let's assume that >>>>>>>>>>>> we change the implementation of tableExists() to not load the >>>>>>>>>>>> table internally: >>>>>>>>>>>> >>>>>>>>>>>> if (tableExists(table_name)) { >>>>>>>>>>>> table = loadTable(table_name); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> Here, you find that the table exists but when you try to load it >>>>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any >>>>>>>>>>>> of these 2 are intuitive. I think the question here is how much an >>>>>>>>>>>> API of the Iceberg table format should know about the existence of >>>>>>>>>>>> tables in other formats. >>>>>>>>>>>> >>>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>>>>>>> HMS >>>>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an >>>>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not >>>>>>>>>>>> sure if any of the 2 approaches are better than the other, I just >>>>>>>>>>>> wanted to show that there is another side of the coin :) >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Gabor >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com >>>>>>>>>>>> <mailto:kevin.jq....@gmail.com>> wrote: >>>>>>>>>>>>> Hi Steve, >>>>>>>>>>>>> >>>>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>>>>>>>> whether a table's name exists in the catalog. For the Hive >>>>>>>>>>>>> catalog, checking the HMS entry should be sufficient. >>>>>>>>>>>>> >>>>>>>>>>>>> I do have a question about usage, though. Typically, I would use >>>>>>>>>>>>> `tableExists` like this: >>>>>>>>>>>>> >>>>>>>>>>>>> ``` >>>>>>>>>>>>> if (!tableExists(table_name)) { >>>>>>>>>>>>> table = createTable(table_name); >>>>>>>>>>>>> } >>>>>>>>>>>>> ``` >>>>>>>>>>>>> What happens when a Hive table with the same name already exists >>>>>>>>>>>>> in the catalog? In the current implementation, `tableExists` >>>>>>>>>>>>> would return `false` because >>>>>>>>>>>>> `HiveOperationsBase.validateTableIsIceberg` throws a >>>>>>>>>>>>> `NoSuchTableException`. >>>>>>>>>>>>> This would cause the code above to attempt to create the table, >>>>>>>>>>>>> only to fail since the name already exists in the HMS. >>>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>>>>>>>> HMS, perhaps it should return true even when a Hive table with >>>>>>>>>>>>> the same name exists. >>>>>>>>>>>>> >>>>>>>>>>>>> I’d love to hear your thoughts on this. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Kevin Liu >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho >>>>>>>>>>>>> <szehon.apa...@gmail.com <mailto:szehon.apa...@gmail.com>> wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's a good performance find and improvement. Left some >>>>>>>>>>>>>> comment on the PR. >>>>>>>>>>>>>> >>>>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so >>>>>>>>>>>>>> I'm supportive of it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>>>>>>>>> Hi Iceberger, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have a proposal to simplify the tableExists API in the Hive >>>>>>>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear >>>>>>>>>>>>>>> your thoughts. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Currently, in our catalog interface[1], the tableExists >>>>>>>>>>>>>>> method is implemented as a default API by invoking the >>>>>>>>>>>>>>> loadTable method. It returns true if the table can be loaded >>>>>>>>>>>>>>> without exceptions. This behavior implies two checks: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The table entry exists in the catalog. >>>>>>>>>>>>>>> The latest metadata.json for the table is not corrupted. >>>>>>>>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. >>>>>>>>>>>>>>> This separates the concerns of table existence and table health >>>>>>>>>>>>>>> (e.g., metadata not corrupted). Such a change could improve the >>>>>>>>>>>>>>> performance of existence checks, especially for RESTcatalog >>>>>>>>>>>>>>> where table existence is abstracted as an HTTP HEAD request [2]. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in the >>>>>>>>>>>>>>> Iceberg codebase to ensure that this optimization would not >>>>>>>>>>>>>>> have any negative impact. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>>>>>>>> [2]: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Steve Zhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>