Hi Steve,

This makes sense to me. The semantics of `tableExists` focus on whether a
table's name exists in the catalog. For the Hive catalog, checking the HMS
entry should be sufficient.

I do have a question about usage, though. Typically, I would use `
tableExists` like this:

```
if (!tableExists(table_name)) {
    table = createTable(table_name);
}
```
What happens when a Hive table with the same name already exists in the
catalog? In the current implementation, `tableExists` would return `false`
because `HiveOperationsBase.validateTableIsIceberg` throws a
`NoSuchTableException`.
This would cause the code above to attempt to create the table, only to
fail since the name already exists in the HMS.
If `tableExists` is meant to check for conflicting entries in the HMS,
perhaps it should return true even when a Hive table with the same name
exists.

I’d love to hear your thoughts on this.

Best,
Kevin Liu

On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi,
>
> It's a good performance find and improvement.   Left some comment on the
> PR.
>
> IMO, the behavior actually more matches the API javadoc ("Check whether
> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>
> Thanks
> Szehon
>
> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
> <hongyue_zh...@apple.com.invalid> wrote:
>
>> Hi Iceberger,
>>
>>   I have a proposal to simplify the tableExists API in the Hive catalog,
>> which involves a behavior change, and I’d like to hear your thoughts.
>>
>>   Currently, in our catalog interface[1], the tableExists method is
>> implemented as a default API by invoking the loadTable method. It
>> returns true if the table can be loaded without exceptions. This
>> behavior implies two checks:
>>
>>    1. The table entry exists in the catalog.
>>    2. The latest metadata.json for the table is not corrupted.
>>
>>   The behavior change I’m proposing focuses only on the first
>> condition—checking if the table entry exists in the catalog. This separates
>> the concerns of table existence and table health (e.g., metadata not
>> corrupted). Such a change could improve the performance of existence
>> checks, especially for RESTcatalog where table existence is abstracted as
>> an HTTP HEAD request [2].
>>
>> I also reviewed the current usage of the tableExists API in the Iceberg
>> codebase to ensure that this optimization would not have any negative
>> impact.
>>
>> I’d love to hear everyone’s feedback on this! If there’s consensus, I can
>> follow up with a similar optimization for the viewExists method in the
>> Hive catalog.
>>
>> [1]: https://github.com/apache/iceberg/pull/11597
>>
>> [2]:
>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>
>>
>>
>> Best regards,
>> Steve Zhang
>>
>>
>>
>>

Reply via email to