> Should add, my personal preference is probably not to change the existing behavior for this part
+1. I realized that this is not a new behavior. The `loadTable` implementation has this problem too. It would be good to have a test case specifically for this edge case and maybe call this out in the documentation. Thanks, Kevin Liu On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Should add, my personal preference is probably not to change the existing > behavior for this part (false, if exists a Hive table with same name) at > the moment, just adding another possibility for consideration. > > Thanks > Szehon > > On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> Thanks Kevin and Gabor, this is an interesting discussion. I guess a >> third option instead of returning true/false in this case, is to change it >> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I >> think is actually what this pr does? >> >> Thanks >> Szehon >> >> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >> <gaborkas...@cloudera.com.invalid> wrote: >> >>> Hey, >>> >>> I think what Kevin says makes sense. However, it would then confuse the >>> opposite use case of this function. Let's assume that we change the >>> implementation of tableExists() to not load the table internally: >>> >>> if (tableExists(table_name)) { >>> table = loadTable(table_name); >>> } >>> >>> Here, you find that the table exists but when you try to load it it >>> fails because it's not an Iceberg table. I don't think that any of these 2 >>> are intuitive. I think the question here is how much an API of the Iceberg >>> table format should know about the existence of tables in other formats. >>> >>> If `tableExists` is meant to check for conflicting entries in the HMS >>> >>> Another interpretation of calling Catalog.tableExists() on an Iceberg >>> API is instead "is there such an Iceberg table". TBH, not sure if any of >>> the 2 approaches are better than the other, I just wanted to show that >>> there is another side of the coin :) >>> >>> Regards, >>> Gabor >>> >>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> >>> wrote: >>> >>>> Hi Steve, >>>> >>>> This makes sense to me. The semantics of `tableExists` focus on whether >>>> a table's name exists in the catalog. For the Hive catalog, checking the >>>> HMS entry should be sufficient. >>>> >>>> I do have a question about usage, though. Typically, I would use ` >>>> tableExists` like this: >>>> >>>> ``` >>>> if (!tableExists(table_name)) { >>>> table = createTable(table_name); >>>> } >>>> ``` >>>> What happens when a Hive table with the same name already exists in the >>>> catalog? In the current implementation, `tableExists` would return `false` >>>> because `HiveOperationsBase.validateTableIsIceberg` throws a >>>> `NoSuchTableException`. >>>> This would cause the code above to attempt to create the table, only to >>>> fail since the name already exists in the HMS. >>>> If `tableExists` is meant to check for conflicting entries in the HMS, >>>> perhaps it should return true even when a Hive table with the same name >>>> exists. >>>> >>>> I’d love to hear your thoughts on this. >>>> >>>> Best, >>>> Kevin Liu >>>> >>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> It's a good performance find and improvement. Left some comment on >>>>> the PR. >>>>> >>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>> whether table exists"), not whether it is corrupted or not, so I'm >>>>> supportive of it. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>> >>>>>> Hi Iceberger, >>>>>> >>>>>> I have a proposal to simplify the tableExists API in the Hive >>>>>> catalog, which involves a behavior change, and I’d like to hear your >>>>>> thoughts. >>>>>> >>>>>> Currently, in our catalog interface[1], the tableExists method is >>>>>> implemented as a default API by invoking the loadTable method. It >>>>>> returns true if the table can be loaded without exceptions. This >>>>>> behavior implies two checks: >>>>>> >>>>>> 1. The table entry exists in the catalog. >>>>>> 2. The latest metadata.json for the table is not corrupted. >>>>>> >>>>>> The behavior change I’m proposing focuses only on the first >>>>>> condition—checking if the table entry exists in the catalog. This >>>>>> separates >>>>>> the concerns of table existence and table health (e.g., metadata not >>>>>> corrupted). Such a change could improve the performance of existence >>>>>> checks, especially for RESTcatalog where table existence is abstracted as >>>>>> an HTTP HEAD request [2]. >>>>>> >>>>>> I also reviewed the current usage of the tableExists API in the >>>>>> Iceberg codebase to ensure that this optimization would not have any >>>>>> negative impact. >>>>>> >>>>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I >>>>>> can follow up with a similar optimization for the viewExists method >>>>>> in the Hive catalog. >>>>>> >>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>> >>>>>> [2]: >>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>> >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Steve Zhang >>>>>> >>>>>> >>>>>> >>>>>>