Should add, my personal preference is probably not to change the existing behavior for this part (false, if exists a Hive table with same name) at the moment, just adding another possibility for consideration.
Thanks Szehon On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Thanks Kevin and Gabor, this is an interesting discussion. I guess a > third option instead of returning true/false in this case, is to change it > to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I > think is actually what this pr does? > > Thanks > Szehon > > On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab > <gaborkas...@cloudera.com.invalid> wrote: > >> Hey, >> >> I think what Kevin says makes sense. However, it would then confuse the >> opposite use case of this function. Let's assume that we change the >> implementation of tableExists() to not load the table internally: >> >> if (tableExists(table_name)) { >> table = loadTable(table_name); >> } >> >> Here, you find that the table exists but when you try to load it it fails >> because it's not an Iceberg table. I don't think that any of these 2 are >> intuitive. I think the question here is how much an API of the Iceberg >> table format should know about the existence of tables in other formats. >> >> If `tableExists` is meant to check for conflicting entries in the HMS >> >> Another interpretation of calling Catalog.tableExists() on an Iceberg API >> is instead "is there such an Iceberg table". TBH, not sure if any of the 2 >> approaches are better than the other, I just wanted to show that there is >> another side of the coin :) >> >> Regards, >> Gabor >> >> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> wrote: >> >>> Hi Steve, >>> >>> This makes sense to me. The semantics of `tableExists` focus on whether >>> a table's name exists in the catalog. For the Hive catalog, checking the >>> HMS entry should be sufficient. >>> >>> I do have a question about usage, though. Typically, I would use ` >>> tableExists` like this: >>> >>> ``` >>> if (!tableExists(table_name)) { >>> table = createTable(table_name); >>> } >>> ``` >>> What happens when a Hive table with the same name already exists in the >>> catalog? In the current implementation, `tableExists` would return `false` >>> because `HiveOperationsBase.validateTableIsIceberg` throws a >>> `NoSuchTableException`. >>> This would cause the code above to attempt to create the table, only to >>> fail since the name already exists in the HMS. >>> If `tableExists` is meant to check for conflicting entries in the HMS, >>> perhaps it should return true even when a Hive table with the same name >>> exists. >>> >>> I’d love to hear your thoughts on this. >>> >>> Best, >>> Kevin Liu >>> >>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> It's a good performance find and improvement. Left some comment on >>>> the PR. >>>> >>>> IMO, the behavior actually more matches the API javadoc ("Check whether >>>> table exists"), not whether it is corrupted or not, so I'm supportive of >>>> it. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>> <hongyue_zh...@apple.com.invalid> wrote: >>>> >>>>> Hi Iceberger, >>>>> >>>>> I have a proposal to simplify the tableExists API in the Hive >>>>> catalog, which involves a behavior change, and I’d like to hear your >>>>> thoughts. >>>>> >>>>> Currently, in our catalog interface[1], the tableExists method is >>>>> implemented as a default API by invoking the loadTable method. It >>>>> returns true if the table can be loaded without exceptions. This >>>>> behavior implies two checks: >>>>> >>>>> 1. The table entry exists in the catalog. >>>>> 2. The latest metadata.json for the table is not corrupted. >>>>> >>>>> The behavior change I’m proposing focuses only on the first >>>>> condition—checking if the table entry exists in the catalog. This >>>>> separates >>>>> the concerns of table existence and table health (e.g., metadata not >>>>> corrupted). Such a change could improve the performance of existence >>>>> checks, especially for RESTcatalog where table existence is abstracted as >>>>> an HTTP HEAD request [2]. >>>>> >>>>> I also reviewed the current usage of the tableExists API in the >>>>> Iceberg codebase to ensure that this optimization would not have any >>>>> negative impact. >>>>> >>>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I >>>>> can follow up with a similar optimization for the viewExists method >>>>> in the Hive catalog. >>>>> >>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>> >>>>> [2]: >>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>> >>>>> >>>>> >>>>> Best regards, >>>>> Steve Zhang >>>>> >>>>> >>>>> >>>>>