Thank you all for your valuable input on this topic. I apologize for not being 
able to keep up with the discussion last week, as I was traveling during the 
U.S. Thanksgiving holiday.

To summarize, I think we have the agreement on following points:
- tableExists API shall indicate whether a table identifier exists in the 
catalog, regardless of metadata corruption
- Currently, we attempt to load the table first to determine table existence, 
but this is not necessary
- For Hive catalog, API can return true if the table identifier exists in HMS 
and it is an iceberg table.
- There are no plans to change other existing behaviors.

I've addressed the feedback from reviewers and also added explicit tests 
coverage, PR is ready for another look in 
https://github.com/apache/iceberg/pull/11597.

Thanks,
Steve Zhang



> On Nov 27, 2024, at 10:15 PM, Péter Váry <peter.vary.apa...@gmail.com> wrote:
> 
> +1 from my side too.
> 
> I wanted to make sure that the community is aware of this change which will 
> bring behavioral difference compared to other catalogs. This is why I have 
> asked Steve to start this thread.
> 
> 
> On Thu, Nov 28, 2024, 02:10 Szehon Ho <szehon.apa...@gmail.com 
> <mailto:szehon.apa...@gmail.com>> wrote:
>> Yea, I think that part is definitely kept.
>> 
>> Thanks
>> Szehon
>> 
>> On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <mailto:rdb...@gmail.com> 
>> <rdb...@gmail.com <mailto:rdb...@gmail.com>> wrote:
>>> I'd support changing the behavior if we still have a way to match the 
>>> intent, which is to return true if the table exists in Hive and is an 
>>> Iceberg table.
>>> 
>>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com 
>>> <mailto:szehon.apa...@gmail.com>> wrote:
>>>> Hm I think the thread got a bit sidetracked by the other question.
>>>> 
>>>> The initial proposal by Steve is a performance improvement for 
>>>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg 
>>>> table metadata, and if successful returns true.  The proposal is to load 
>>>> from Hive only, and return true if Hive metadata identifies that an 
>>>> Iceberg table exists with this name.  
>>>> 
>>>> Checking corruption of Iceberg's table metadata.json is a side-effect of 
>>>> the current behavior, but would not anymore with the proposed change.  
>>>> That's the question of the original thread, and so far there's agreement 
>>>> that it is not necessarily part of this scope of HiveCatalog's 
>>>> tableExists().  
>>>> 
>>>> At least this is my understanding.
>>>> Thanks,
>>>> Szehon
>>>> 
>>>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com 
>>>> <mailto:rdb...@gmail.com> <rdb...@gmail.com <mailto:rdb...@gmail.com>> 
>>>> wrote:
>>>>> What kind of corruption are you referring to? I would expect corruption 
>>>>> to result in an exception when loading the table, but that the table 
>>>>> should still exist. The problem is likely that we determine if a table 
>>>>> exists by attempting to load it. We could fix that by not attempting to 
>>>>> load the table. I think that's a reasonable solution.
>>>>> 
>>>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com 
>>>>> <mailto:owenzhang1...@gmail.com>> wrote:
>>>>>>> The current behavior's intent is not to check whether the metadata is 
>>>>>>> valid, it is to detect whether the table is an Iceberg table.
>>>>>> 
>>>>>> Is there a way to detect this from HiveCatalog without loading the 
>>>>>> table? 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com 
>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote:
>>>>>>> I think we have an agreement, not to change the behavior wrt existing 
>>>>>>> non-Iceberg tables, and throw an exception.
>>>>>>> 
>>>>>>> Are we also in agreement with the original proposal to return true when 
>>>>>>> the table exists but the metadata is somehow corrupted? Note: this is 
>>>>>>> the proposed change of behavior why the thread was originally started.
>>>>>>> 
>>>>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <mailto:rdb...@gmail.com> 
>>>>>>> <rdb...@gmail.com <mailto:rdb...@gmail.com>> wrote:
>>>>>>>> I'd argue against changing this. The current behavior's intent is not 
>>>>>>>> to check whether the metadata is valid, it is to detect whether the 
>>>>>>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing 
>>>>>>>> that behavior would be surprising, especially if we started throwing 
>>>>>>>> exceptions.
>>>>>>>> 
>>>>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org 
>>>>>>>> <mailto:kevinjq...@apache.org>> wrote:
>>>>>>>>> > Should add, my personal preference is probably not to change the 
>>>>>>>>> > existing behavior for this part
>>>>>>>>> 
>>>>>>>>> +1. I realized that this is not a new behavior. The `loadTable` 
>>>>>>>>> implementation has this problem too. 
>>>>>>>>> It would be good to have a test case specifically for this edge case 
>>>>>>>>> and maybe call this out in the documentation. 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Kevin Liu
>>>>>>>>> 
>>>>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com 
>>>>>>>>> <mailto:szehon.apa...@gmail.com>> wrote:
>>>>>>>>>> Should add, my personal preference is probably not to change the 
>>>>>>>>>> existing behavior for this part (false, if exists a Hive table with 
>>>>>>>>>> same name) at the moment, just adding another possibility for 
>>>>>>>>>> consideration.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>> 
>>>>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com 
>>>>>>>>>> <mailto:szehon.apa...@gmail.com>> wrote:
>>>>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess 
>>>>>>>>>>> a third option instead of returning true/false in this case, is to 
>>>>>>>>>>> change it to throw an NoSuchIcebergTableException if its a 
>>>>>>>>>>> non-Iceberg table, which I think is actually what this pr does?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab 
>>>>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>>>>>>> Hey,
>>>>>>>>>>>> 
>>>>>>>>>>>> I think what Kevin says makes sense. However, it would then 
>>>>>>>>>>>> confuse the opposite use case of this function. Let's assume that 
>>>>>>>>>>>> we change the implementation of tableExists() to not load the 
>>>>>>>>>>>> table internally:
>>>>>>>>>>>> 
>>>>>>>>>>>> if (tableExists(table_name)) {
>>>>>>>>>>>>     table = loadTable(table_name);
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> Here, you find that the table exists but when you try to load it 
>>>>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any 
>>>>>>>>>>>> of these 2 are intuitive. I think the question here is how much an 
>>>>>>>>>>>> API of the Iceberg table format should know about the existence of 
>>>>>>>>>>>> tables in other formats.
>>>>>>>>>>>> 
>>>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the 
>>>>>>>>>>>>> HMS
>>>>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an 
>>>>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not 
>>>>>>>>>>>> sure if any of the 2 approaches are better than the other, I just 
>>>>>>>>>>>> wanted to show that there is another side of the coin :)
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Gabor
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com 
>>>>>>>>>>>> <mailto:kevin.jq....@gmail.com>> wrote:
>>>>>>>>>>>>> Hi Steve,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on 
>>>>>>>>>>>>> whether a table's name exists in the catalog. For the Hive 
>>>>>>>>>>>>> catalog, checking the HMS entry should be sufficient.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I do have a question about usage, though. Typically, I would use 
>>>>>>>>>>>>> `tableExists` like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>>>>>>     table = createTable(table_name);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> What happens when a Hive table with the same name already exists 
>>>>>>>>>>>>> in the catalog? In the current implementation, `tableExists` 
>>>>>>>>>>>>> would return `false` because 
>>>>>>>>>>>>> `HiveOperationsBase.validateTableIsIceberg` throws a 
>>>>>>>>>>>>> `NoSuchTableException`. 
>>>>>>>>>>>>> This would cause the code above to attempt to create the table, 
>>>>>>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the 
>>>>>>>>>>>>> HMS, perhaps it should return true even when a Hive table with 
>>>>>>>>>>>>> the same name exists.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>>>>>>>>>>>>> <szehon.apa...@gmail.com <mailto:szehon.apa...@gmail.com>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's a good performance find and improvement.   Left some 
>>>>>>>>>>>>>> comment on the PR.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check 
>>>>>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so 
>>>>>>>>>>>>>> I'm supportive of it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang 
>>>>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   I have a proposal to simplify the tableExists API in the Hive 
>>>>>>>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear 
>>>>>>>>>>>>>>> your thoughts.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists 
>>>>>>>>>>>>>>> method is implemented as a default API by invoking the 
>>>>>>>>>>>>>>> loadTable method. It returns true if the table can be loaded 
>>>>>>>>>>>>>>> without exceptions. This behavior implies two checks:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The table entry exists in the catalog.
>>>>>>>>>>>>>>> The latest metadata.json for the table is not corrupted.
>>>>>>>>>>>>>>>   The behavior change I’m proposing focuses only on the first 
>>>>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. 
>>>>>>>>>>>>>>> This separates the concerns of table existence and table health 
>>>>>>>>>>>>>>> (e.g., metadata not corrupted). Such a change could improve the 
>>>>>>>>>>>>>>> performance of existence checks, especially for RESTcatalog 
>>>>>>>>>>>>>>> where table existence is abstracted as an HTTP HEAD request [2].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in the 
>>>>>>>>>>>>>>> Iceberg codebase to ensure that this optimization would not 
>>>>>>>>>>>>>>> have any negative impact.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s 
>>>>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the 
>>>>>>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>>>>>> [2]: 
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Steve Zhang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 

Reply via email to