+1 from my side too. I wanted to make sure that the community is aware of this change which will bring behavioral difference compared to other catalogs. This is why I have asked Steve to start this thread.
On Thu, Nov 28, 2024, 02:10 Szehon Ho <szehon.apa...@gmail.com> wrote: > Yea, I think that part is definitely kept. > > Thanks > Szehon > > On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <rdb...@gmail.com> > wrote: > >> I'd support changing the behavior if we still have a way to match the >> intent, which is to return true if the table exists in Hive and is an >> Iceberg table. >> >> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> Hm I think the thread got a bit sidetracked by the other question. >>> >>> The initial proposal by Steve is a performance improvement for >>> HiveCatalog's tableExists(). Currently it loads both Hive and Iceberg >>> table metadata, and if successful returns true. The proposal is to load >>> from Hive only, and return true if Hive metadata identifies that an Iceberg >>> table exists with this name. >>> >>> Checking corruption of Iceberg's table metadata.json is a side-effect of >>> the current behavior, but would not anymore with the proposed change. >>> That's the question of the original thread, and so far there's agreement >>> that it is not necessarily part of this scope of HiveCatalog's >>> tableExists(). >>> >>> At least this is my understanding. >>> Thanks, >>> Szehon >>> >>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> What kind of corruption are you referring to? I would expect corruption >>>> to result in an exception when loading the table, but that the table should >>>> still exist. The problem is likely that we determine if a table exists by >>>> attempting to load it. We could fix that by not attempting to load the >>>> table. I think that's a reasonable solution. >>>> >>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com> >>>> wrote: >>>> >>>>> The current behavior's intent is not to check whether the metadata is >>>>>> valid, it is to detect whether the table is an Iceberg table. >>>>> >>>>> >>>>> Is there a way to detect this from HiveCatalog without loading the >>>>> table? >>>>> >>>>> >>>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> I think we have an agreement, not to change the behavior wrt existing >>>>>> non-Iceberg tables, and throw an exception. >>>>>> >>>>>> Are we also in agreement with the original proposal to return true >>>>>> when the table exists but the metadata is somehow corrupted? Note: this >>>>>> is >>>>>> the proposed change of behavior why the thread was originally started. >>>>>> >>>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I'd argue against changing this. The current behavior's intent is >>>>>>> not to check whether the metadata is valid, it is to detect whether the >>>>>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing that >>>>>>> behavior would be surprising, especially if we started throwing >>>>>>> exceptions. >>>>>>> >>>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> > Should add, my personal preference is probably not to change the >>>>>>>> existing behavior for this part >>>>>>>> >>>>>>>> +1. I realized that this is not a new behavior. The `loadTable` >>>>>>>> implementation has this problem too. >>>>>>>> It would be good to have a test case specifically for this edge >>>>>>>> case and maybe call this out in the documentation. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kevin Liu >>>>>>>> >>>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Should add, my personal preference is probably not to change the >>>>>>>>> existing behavior for this part (false, if exists a Hive table with >>>>>>>>> same >>>>>>>>> name) at the moment, just adding another possibility for >>>>>>>>> consideration. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Szehon >>>>>>>>> >>>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion. I >>>>>>>>>> guess a third option instead of returning true/false in this case, >>>>>>>>>> is to >>>>>>>>>> change it to throw an NoSuchIcebergTableException if its a >>>>>>>>>> non-Iceberg >>>>>>>>>> table, which I think is actually what this pr does? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Szehon >>>>>>>>>> >>>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab >>>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> Hey, >>>>>>>>>>> >>>>>>>>>>> I think what Kevin says makes sense. However, it would then >>>>>>>>>>> confuse the opposite use case of this function. Let's assume that >>>>>>>>>>> we change >>>>>>>>>>> the implementation of tableExists() to not load the table >>>>>>>>>>> internally: >>>>>>>>>>> >>>>>>>>>>> if (tableExists(table_name)) { >>>>>>>>>>> table = loadTable(table_name); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Here, you find that the table exists but when you try to load it >>>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any >>>>>>>>>>> of these >>>>>>>>>>> 2 are intuitive. I think the question here is how much an API of the >>>>>>>>>>> Iceberg table format should know about the existence of tables in >>>>>>>>>>> other >>>>>>>>>>> formats. >>>>>>>>>>> >>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in >>>>>>>>>>>> the HMS >>>>>>>>>>> >>>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an >>>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not >>>>>>>>>>> sure if >>>>>>>>>>> any of the 2 approaches are better than the other, I just wanted to >>>>>>>>>>> show >>>>>>>>>>> that there is another side of the coin :) >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Gabor >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu < >>>>>>>>>>> kevin.jq....@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Steve, >>>>>>>>>>>> >>>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on >>>>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog, >>>>>>>>>>>> checking the HMS entry should be sufficient. >>>>>>>>>>>> >>>>>>>>>>>> I do have a question about usage, though. Typically, I would >>>>>>>>>>>> use `tableExists` like this: >>>>>>>>>>>> >>>>>>>>>>>> ``` >>>>>>>>>>>> if (!tableExists(table_name)) { >>>>>>>>>>>> table = createTable(table_name); >>>>>>>>>>>> } >>>>>>>>>>>> ``` >>>>>>>>>>>> What happens when a Hive table with the same name already >>>>>>>>>>>> exists in the catalog? In the current implementation, >>>>>>>>>>>> `tableExists` would >>>>>>>>>>>> return `false` because `HiveOperationsBase.validateTableIsIceberg` >>>>>>>>>>>> throws a >>>>>>>>>>>> `NoSuchTableException`. >>>>>>>>>>>> This would cause the code above to attempt to create the table, >>>>>>>>>>>> only to fail since the name already exists in the HMS. >>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in >>>>>>>>>>>> the HMS, perhaps it should return true even when a Hive table with >>>>>>>>>>>> the same >>>>>>>>>>>> name exists. >>>>>>>>>>>> >>>>>>>>>>>> I’d love to hear your thoughts on this. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Kevin Liu >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho < >>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> It's a good performance find and improvement. Left some >>>>>>>>>>>>> comment on the PR. >>>>>>>>>>>>> >>>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc >>>>>>>>>>>>> ("Check whether table exists"), not whether it is corrupted or >>>>>>>>>>>>> not, so I'm >>>>>>>>>>>>> supportive of it. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Szehon >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang >>>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Iceberger, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have a proposal to simplify the tableExists API in the >>>>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to >>>>>>>>>>>>>> hear your >>>>>>>>>>>>>> thoughts. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently, in our catalog interface[1], the tableExists method >>>>>>>>>>>>>> is implemented as a default API by invoking the loadTable method. >>>>>>>>>>>>>> It returns true if the table can be loaded without >>>>>>>>>>>>>> exceptions. This behavior implies two checks: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. The table entry exists in the catalog. >>>>>>>>>>>>>> 2. The latest metadata.json for the table is not >>>>>>>>>>>>>> corrupted. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The behavior change I’m proposing focuses only on the first >>>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. >>>>>>>>>>>>>> This separates >>>>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata >>>>>>>>>>>>>> not >>>>>>>>>>>>>> corrupted). Such a change could improve the performance of >>>>>>>>>>>>>> existence >>>>>>>>>>>>>> checks, especially for RESTcatalog where table existence is >>>>>>>>>>>>>> abstracted as >>>>>>>>>>>>>> an HTTP HEAD request [2]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in >>>>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not >>>>>>>>>>>>>> have any >>>>>>>>>>>>>> negative impact. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s >>>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the >>>>>>>>>>>>>> viewExists method in the Hive catalog. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [2]: >>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> Steve Zhang >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>