Thanks Kevin and Gabor, this is an interesting discussion.  I guess a third
option instead of returning true/false in this case, is to change it to
throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
think is actually what this pr does?

Thanks
Szehon

On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
<gaborkas...@cloudera.com.invalid> wrote:

> Hey,
>
> I think what Kevin says makes sense. However, it would then confuse the
> opposite use case of this function. Let's assume that we change the
> implementation of tableExists() to not load the table internally:
>
> if (tableExists(table_name)) {
>     table = loadTable(table_name);
> }
>
> Here, you find that the table exists but when you try to load it it fails
> because it's not an Iceberg table. I don't think that any of these 2 are
> intuitive. I think the question here is how much an API of the Iceberg
> table format should know about the existence of tables in other formats.
>
> If `tableExists` is meant to check for conflicting entries in the HMS
>
> Another interpretation of calling Catalog.tableExists() on an Iceberg API
> is instead "is there such an Iceberg table". TBH, not sure if any of the 2
> approaches are better than the other, I just wanted to show that there is
> another side of the coin :)
>
> Regards,
> Gabor
>
> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> wrote:
>
>> Hi Steve,
>>
>> This makes sense to me. The semantics of `tableExists` focus on whether a
>> table's name exists in the catalog. For the Hive catalog, checking the HMS
>> entry should be sufficient.
>>
>> I do have a question about usage, though. Typically, I would use `
>> tableExists` like this:
>>
>> ```
>> if (!tableExists(table_name)) {
>>     table = createTable(table_name);
>> }
>> ```
>> What happens when a Hive table with the same name already exists in the
>> catalog? In the current implementation, `tableExists` would return `false`
>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>> `NoSuchTableException`.
>> This would cause the code above to attempt to create the table, only to
>> fail since the name already exists in the HMS.
>> If `tableExists` is meant to check for conflicting entries in the HMS,
>> perhaps it should return true even when a Hive table with the same name
>> exists.
>>
>> I’d love to hear your thoughts on this.
>>
>> Best,
>> Kevin Liu
>>
>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It's a good performance find and improvement.   Left some comment on the
>>> PR.
>>>
>>> IMO, the behavior actually more matches the API javadoc ("Check whether
>>> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>
>>>> Hi Iceberger,
>>>>
>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>> thoughts.
>>>>
>>>>   Currently, in our catalog interface[1], the tableExists method is
>>>> implemented as a default API by invoking the loadTable method. It
>>>> returns true if the table can be loaded without exceptions. This
>>>> behavior implies two checks:
>>>>
>>>>    1. The table entry exists in the catalog.
>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>
>>>>   The behavior change I’m proposing focuses only on the first
>>>> condition—checking if the table entry exists in the catalog. This separates
>>>> the concerns of table existence and table health (e.g., metadata not
>>>> corrupted). Such a change could improve the performance of existence
>>>> checks, especially for RESTcatalog where table existence is abstracted as
>>>> an HTTP HEAD request [2].
>>>>
>>>> I also reviewed the current usage of the tableExists API in the
>>>> Iceberg codebase to ensure that this optimization would not have any
>>>> negative impact.
>>>>
>>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>>>> can follow up with a similar optimization for the viewExists method in
>>>> the Hive catalog.
>>>>
>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>
>>>> [2]:
>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>
>>>>
>>>>
>>>> Best regards,
>>>> Steve Zhang
>>>>
>>>>
>>>>
>>>>

Reply via email to