> > The current behavior's intent is not to check whether the metadata is > valid, it is to detect whether the table is an Iceberg table.
Is there a way to detect this from HiveCatalog without loading the table? On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > I think we have an agreement, not to change the behavior wrt existing > non-Iceberg tables, and throw an exception. > > Are we also in agreement with the original proposal to return true when > the table exists but the metadata is somehow corrupted? Note: this is the > proposed change of behavior why the thread was originally started. > > On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote: > >> I'd argue against changing this. The current behavior's intent is not to >> check whether the metadata is valid, it is to detect whether the table is >> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior >> would be surprising, especially if we started throwing exceptions. >> >> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> wrote: >> >>> > Should add, my personal preference is probably not to change the >>> existing behavior for this part >>> >>> +1. I realized that this is not a new behavior. The `loadTable` >>> implementation has this problem too. >>> It would be good to have a test case specifically for this edge case and >>> maybe call this out in the documentation. >>> >>> Thanks, >>> Kevin Liu >>> >>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> >>> wrote: >>> >>>> Should add, my personal preference is probably not to change the >>>> existing behavior for this part (false, if exists a Hive table with same >>>> name) at the moment, just adding another possibility for consideration. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Kevin and Gabor, this is an interesting discussion. I guess a >>>>> third option instead of returning true/false in this case, is to change it >>>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which >>>>> I >>>>> think is actually what this pr does? >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> I think what Kevin says makes sense. However, it would then confuse >>>>>> the opposite use case of this function. Let's assume that we change the >>>>>> implementation of tableExists() to not load the table internally: >>>>>> >>>>>> if (tableExists(table_name)) { >>>>>> table = loadTable(table_name); >>>>>> } >>>>>> >>>>>> Here, you find that the table exists but when you try to load it it >>>>>> fails because it's not an Iceberg table. I don't think that any of these >>>>>> 2 >>>>>> are intuitive. I think the question here is how much an API of the >>>>>> Iceberg >>>>>> table format should know about the existence of tables in other formats. >>>>>> >>>>>> If `tableExists` is meant to check for conflicting entries in the HMS >>>>>> >>>>>> Another interpretation of calling Catalog.tableExists() on an Iceberg >>>>>> API is instead "is there such an Iceberg table". TBH, not sure if any of >>>>>> the 2 approaches are better than the other, I just wanted to show that >>>>>> there is another side of the coin :) >>>>>> >>>>>> Regards, >>>>>> Gabor >>>>>> >>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Steve, >>>>>>> >>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>> whether a table's name exists in the catalog. For the Hive catalog, >>>>>>> checking the HMS entry should be sufficient. >>>>>>> >>>>>>> I do have a question about usage, though. Typically, I would use ` >>>>>>> tableExists` like this: >>>>>>> >>>>>>> ``` >>>>>>> if (!tableExists(table_name)) { >>>>>>> table = createTable(table_name); >>>>>>> } >>>>>>> ``` >>>>>>> What happens when a Hive table with the same name already exists in >>>>>>> the catalog? In the current implementation, `tableExists` would return >>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a >>>>>>> `NoSuchTableException`. >>>>>>> This would cause the code above to attempt to create the table, only >>>>>>> to fail since the name already exists in the HMS. >>>>>>> If `tableExists` is meant to check for conflicting entries in the >>>>>>> HMS, perhaps it should return true even when a Hive table with the same >>>>>>> name exists. >>>>>>> >>>>>>> I’d love to hear your thoughts on this. >>>>>>> >>>>>>> Best, >>>>>>> Kevin Liu >>>>>>> >>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> It's a good performance find and improvement. Left some comment >>>>>>>> on the PR. >>>>>>>> >>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check >>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm >>>>>>>> supportive of it. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>> >>>>>>>>> Hi Iceberger, >>>>>>>>> >>>>>>>>> I have a proposal to simplify the tableExists API in the Hive >>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your >>>>>>>>> thoughts. >>>>>>>>> >>>>>>>>> Currently, in our catalog interface[1], the tableExists method >>>>>>>>> is implemented as a default API by invoking the loadTable method. >>>>>>>>> It returns true if the table can be loaded without exceptions. >>>>>>>>> This behavior implies two checks: >>>>>>>>> >>>>>>>>> 1. The table entry exists in the catalog. >>>>>>>>> 2. The latest metadata.json for the table is not corrupted. >>>>>>>>> >>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>> condition—checking if the table entry exists in the catalog. This >>>>>>>>> separates >>>>>>>>> the concerns of table existence and table health (e.g., metadata not >>>>>>>>> corrupted). Such a change could improve the performance of existence >>>>>>>>> checks, especially for RESTcatalog where table existence is >>>>>>>>> abstracted as >>>>>>>>> an HTTP HEAD request [2]. >>>>>>>>> >>>>>>>>> I also reviewed the current usage of the tableExists API in the >>>>>>>>> Iceberg codebase to ensure that this optimization would not have any >>>>>>>>> negative impact. >>>>>>>>> >>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>> >>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>> >>>>>>>>> [2]: >>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Steve Zhang >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>