I think we have an agreement, not to change the behavior wrt existing
non-Iceberg tables, and throw an exception.

Are we also in agreement with the original proposal to return true when the
table exists but the metadata is somehow corrupted? Note: this is the
proposed change of behavior why the thread was originally started.

On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:

> I'd argue against changing this. The current behavior's intent is not to
> check whether the metadata is valid, it is to detect whether the table is
> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
> would be surprising, especially if we started throwing exceptions.
>
> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> wrote:
>
>> > Should add, my personal preference is probably not to change the
>> existing behavior for this part
>>
>> +1. I realized that this is not a new behavior. The `loadTable`
>> implementation has this problem too.
>> It would be good to have a test case specifically for this edge case and
>> maybe call this out in the documentation.
>>
>> Thanks,
>> Kevin Liu
>>
>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>>> Should add, my personal preference is probably not to change the
>>> existing behavior for this part (false, if exists a Hive table with same
>>> name) at the moment, just adding another possibility for consideration.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>>>> third option instead of returning true/false in this case, is to change it
>>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
>>>> think is actually what this pr does?
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> I think what Kevin says makes sense. However, it would then confuse
>>>>> the opposite use case of this function. Let's assume that we change the
>>>>> implementation of tableExists() to not load the table internally:
>>>>>
>>>>> if (tableExists(table_name)) {
>>>>>     table = loadTable(table_name);
>>>>> }
>>>>>
>>>>> Here, you find that the table exists but when you try to load it it
>>>>> fails because it's not an Iceberg table. I don't think that any of these 2
>>>>> are intuitive. I think the question here is how much an API of the Iceberg
>>>>> table format should know about the existence of tables in other formats.
>>>>>
>>>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>>>
>>>>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>>>>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>>>>> the 2 approaches are better than the other, I just wanted to show that
>>>>> there is another side of the coin :)
>>>>>
>>>>> Regards,
>>>>> Gabor
>>>>>
>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Steve,
>>>>>>
>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>> checking the HMS entry should be sufficient.
>>>>>>
>>>>>> I do have a question about usage, though. Typically, I would use `
>>>>>> tableExists` like this:
>>>>>>
>>>>>> ```
>>>>>> if (!tableExists(table_name)) {
>>>>>>     table = createTable(table_name);
>>>>>> }
>>>>>> ```
>>>>>> What happens when a Hive table with the same name already exists in
>>>>>> the catalog? In the current implementation, `tableExists` would return
>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>> `NoSuchTableException`.
>>>>>> This would cause the code above to attempt to create the table, only
>>>>>> to fail since the name already exists in the HMS.
>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>> HMS, perhaps it should return true even when a Hive table with the same
>>>>>> name exists.
>>>>>>
>>>>>> I’d love to hear your thoughts on this.
>>>>>>
>>>>>> Best,
>>>>>> Kevin Liu
>>>>>>
>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It's a good performance find and improvement.   Left some comment on
>>>>>>> the PR.
>>>>>>>
>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>> supportive of it.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Iceberger,
>>>>>>>>
>>>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>>>> thoughts.
>>>>>>>>
>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>> It returns true if the table can be loaded without exceptions.
>>>>>>>> This behavior implies two checks:
>>>>>>>>
>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>
>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>> separates
>>>>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>>>>> corrupted). Such a change could improve the performance of existence
>>>>>>>> checks, especially for RESTcatalog where table existence is abstracted 
>>>>>>>> as
>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>
>>>>>>>> I also reviewed the current usage of the tableExists API in the
>>>>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>>>>> negative impact.
>>>>>>>>
>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s consensus,
>>>>>>>> I can follow up with a similar optimization for the viewExists method
>>>>>>>> in the Hive catalog.
>>>>>>>>
>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>
>>>>>>>> [2]:
>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Steve Zhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Reply via email to