Should add, my personal preference is probably not to change the existing
behavior for this part (false, if exists a Hive table with same name) at
the moment, just adding another possibility for consideration.

Thanks
Szehon

On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
> third option instead of returning true/false in this case, is to change it
> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
> think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
> <gaborkas...@cloudera.com.invalid> wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then confuse the
>> opposite use case of this function. Let's assume that we change the
>> implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>>     table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it it fails
>> because it's not an Iceberg table. I don't think that any of these 2 are
>> intuitive. I think the question here is how much an API of the Iceberg
>> table format should know about the existence of tables in other formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the HMS
>>
>> Another interpretation of calling Catalog.tableExists() on an Iceberg API
>> is instead "is there such an Iceberg table". TBH, not sure if any of the 2
>> approaches are better than the other, I just wanted to show that there is
>> another side of the coin :)
>>
>> Regards,
>> Gabor
>>
>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> wrote:
>>
>>> Hi Steve,
>>>
>>> This makes sense to me. The semantics of `tableExists` focus on whether
>>> a table's name exists in the catalog. For the Hive catalog, checking the
>>> HMS entry should be sufficient.
>>>
>>> I do have a question about usage, though. Typically, I would use `
>>> tableExists` like this:
>>>
>>> ```
>>> if (!tableExists(table_name)) {
>>>     table = createTable(table_name);
>>> }
>>> ```
>>> What happens when a Hive table with the same name already exists in the
>>> catalog? In the current implementation, `tableExists` would return `false`
>>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>>> `NoSuchTableException`.
>>> This would cause the code above to attempt to create the table, only to
>>> fail since the name already exists in the HMS.
>>> If `tableExists` is meant to check for conflicting entries in the HMS,
>>> perhaps it should return true even when a Hive table with the same name
>>> exists.
>>>
>>> I’d love to hear your thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It's a good performance find and improvement.   Left some comment on
>>>> the PR.
>>>>
>>>> IMO, the behavior actually more matches the API javadoc ("Check whether
>>>> table exists"), not whether it is corrupted or not, so I'm supportive of 
>>>> it.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>
>>>>> Hi Iceberger,
>>>>>
>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>> thoughts.
>>>>>
>>>>>   Currently, in our catalog interface[1], the tableExists method is
>>>>> implemented as a default API by invoking the loadTable method. It
>>>>> returns true if the table can be loaded without exceptions. This
>>>>> behavior implies two checks:
>>>>>
>>>>>    1. The table entry exists in the catalog.
>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>
>>>>>   The behavior change I’m proposing focuses only on the first
>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>> separates
>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>> corrupted). Such a change could improve the performance of existence
>>>>> checks, especially for RESTcatalog where table existence is abstracted as
>>>>> an HTTP HEAD request [2].
>>>>>
>>>>> I also reviewed the current usage of the tableExists API in the
>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>> negative impact.
>>>>>
>>>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>>>>> can follow up with a similar optimization for the viewExists method
>>>>> in the Hive catalog.
>>>>>
>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>
>>>>> [2]:
>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Steve Zhang
>>>>>
>>>>>
>>>>>
>>>>>

Reply via email to