> Should add, my personal preference is probably not to change the existing
behavior for this part

+1. I realized that this is not a new behavior. The `loadTable`
implementation has this problem too.
It would be good to have a test case specifically for this edge case and
maybe call this out in the documentation.

Thanks,
Kevin Liu

On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Should add, my personal preference is probably not to change the existing
> behavior for this part (false, if exists a Hive table with same name) at
> the moment, just adding another possibility for consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>> third option instead of returning true/false in this case, is to change it
>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
>> think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>> <gaborkas...@cloudera.com.invalid> wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then confuse the
>>> opposite use case of this function. Let's assume that we change the
>>> implementation of tableExists() to not load the table internally:
>>>
>>> if (tableExists(table_name)) {
>>>     table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you try to load it it
>>> fails because it's not an Iceberg table. I don't think that any of these 2
>>> are intuitive. I think the question here is how much an API of the Iceberg
>>> table format should know about the existence of tables in other formats.
>>>
>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>
>>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>>> the 2 approaches are better than the other, I just wanted to show that
>>> there is another side of the coin :)
>>>
>>> Regards,
>>> Gabor
>>>
>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>> wrote:
>>>
>>>> Hi Steve,
>>>>
>>>> This makes sense to me. The semantics of `tableExists` focus on whether
>>>> a table's name exists in the catalog. For the Hive catalog, checking the
>>>> HMS entry should be sufficient.
>>>>
>>>> I do have a question about usage, though. Typically, I would use `
>>>> tableExists` like this:
>>>>
>>>> ```
>>>> if (!tableExists(table_name)) {
>>>>     table = createTable(table_name);
>>>> }
>>>> ```
>>>> What happens when a Hive table with the same name already exists in the
>>>> catalog? In the current implementation, `tableExists` would return `false`
>>>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>> `NoSuchTableException`.
>>>> This would cause the code above to attempt to create the table, only to
>>>> fail since the name already exists in the HMS.
>>>> If `tableExists` is meant to check for conflicting entries in the HMS,
>>>> perhaps it should return true even when a Hive table with the same name
>>>> exists.
>>>>
>>>> I’d love to hear your thoughts on this.
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It's a good performance find and improvement.   Left some comment on
>>>>> the PR.
>>>>>
>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>> supportive of it.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>
>>>>>> Hi Iceberger,
>>>>>>
>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>> thoughts.
>>>>>>
>>>>>>   Currently, in our catalog interface[1], the tableExists method is
>>>>>> implemented as a default API by invoking the loadTable method. It
>>>>>> returns true if the table can be loaded without exceptions. This
>>>>>> behavior implies two checks:
>>>>>>
>>>>>>    1. The table entry exists in the catalog.
>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>
>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>> separates
>>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>>> corrupted). Such a change could improve the performance of existence
>>>>>> checks, especially for RESTcatalog where table existence is abstracted as
>>>>>> an HTTP HEAD request [2].
>>>>>>
>>>>>> I also reviewed the current usage of the tableExists API in the
>>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>>> negative impact.
>>>>>>
>>>>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>>>>>> can follow up with a similar optimization for the viewExists method
>>>>>> in the Hive catalog.
>>>>>>
>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>
>>>>>> [2]:
>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Steve Zhang
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Reply via email to