Hey,

I think what Kevin says makes sense. However, it would then confuse the
opposite use case of this function. Let's assume that we change the
implementation of tableExists() to not load the table internally:

if (tableExists(table_name)) {
    table = loadTable(table_name);
}

Here, you find that the table exists but when you try to load it it fails
because it's not an Iceberg table. I don't think that any of these 2 are
intuitive. I think the question here is how much an API of the Iceberg
table format should know about the existence of tables in other formats.

If `tableExists` is meant to check for conflicting entries in the HMS

Another interpretation of calling Catalog.tableExists() on an Iceberg API
is instead "is there such an Iceberg table". TBH, not sure if any of the 2
approaches are better than the other, I just wanted to show that there is
another side of the coin :)

Regards,
Gabor

On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com> wrote:

> Hi Steve,
>
> This makes sense to me. The semantics of `tableExists` focus on whether a
> table's name exists in the catalog. For the Hive catalog, checking the HMS
> entry should be sufficient.
>
> I do have a question about usage, though. Typically, I would use `
> tableExists` like this:
>
> ```
> if (!tableExists(table_name)) {
>     table = createTable(table_name);
> }
> ```
> What happens when a Hive table with the same name already exists in the
> catalog? In the current implementation, `tableExists` would return `false`
> because `HiveOperationsBase.validateTableIsIceberg` throws a
> `NoSuchTableException`.
> This would cause the code above to attempt to create the table, only to
> fail since the name already exists in the HMS.
> If `tableExists` is meant to check for conflicting entries in the HMS,
> perhaps it should return true even when a Hive table with the same name
> exists.
>
> I’d love to hear your thoughts on this.
>
> Best,
> Kevin Liu
>
> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> Hi,
>>
>> It's a good performance find and improvement.   Left some comment on the
>> PR.
>>
>> IMO, the behavior actually more matches the API javadoc ("Check whether
>> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>> <hongyue_zh...@apple.com.invalid> wrote:
>>
>>> Hi Iceberger,
>>>
>>>   I have a proposal to simplify the tableExists API in the Hive
>>> catalog, which involves a behavior change, and I’d like to hear your
>>> thoughts.
>>>
>>>   Currently, in our catalog interface[1], the tableExists method is
>>> implemented as a default API by invoking the loadTable method. It
>>> returns true if the table can be loaded without exceptions. This
>>> behavior implies two checks:
>>>
>>>    1. The table entry exists in the catalog.
>>>    2. The latest metadata.json for the table is not corrupted.
>>>
>>>   The behavior change I’m proposing focuses only on the first
>>> condition—checking if the table entry exists in the catalog. This separates
>>> the concerns of table existence and table health (e.g., metadata not
>>> corrupted). Such a change could improve the performance of existence
>>> checks, especially for RESTcatalog where table existence is abstracted as
>>> an HTTP HEAD request [2].
>>>
>>> I also reviewed the current usage of the tableExists API in the Iceberg
>>> codebase to ensure that this optimization would not have any negative
>>> impact.
>>>
>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>>> can follow up with a similar optimization for the viewExists method in
>>> the Hive catalog.
>>>
>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>
>>> [2]:
>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>
>>>
>>>
>>> Best regards,
>>> Steve Zhang
>>>
>>>
>>>
>>>

Reply via email to