Hm I think the thread got a bit sidetracked by the other question.

The initial proposal by Steve is a performance improvement for
HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
table metadata, and if successful returns true.  The proposal is to load
from Hive only, and return true if Hive metadata identifies that an Iceberg
table exists with this name.

Checking corruption of Iceberg's table metadata.json is a side-effect of
the current behavior, but would not anymore with the proposed change.
That's the question of the original thread, and so far there's agreement
that it is not necessarily part of this scope of HiveCatalog's
tableExists().

At least this is my understanding.
Thanks,
Szehon

On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> What kind of corruption are you referring to? I would expect corruption to
> result in an exception when loading the table, but that the table should
> still exist. The problem is likely that we determine if a table exists by
> attempting to load it. We could fix that by not attempting to load the
> table. I think that's a reasonable solution.
>
> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> The current behavior's intent is not to check whether the metadata is
>>> valid, it is to detect whether the table is an Iceberg table.
>>
>>
>> Is there a way to detect this from HiveCatalog without loading the table?
>>
>>
>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> I think we have an agreement, not to change the behavior wrt existing
>>> non-Iceberg tables, and throw an exception.
>>>
>>> Are we also in agreement with the original proposal to return true when
>>> the table exists but the metadata is somehow corrupted? Note: this is the
>>> proposed change of behavior why the thread was originally started.
>>>
>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:
>>>
>>>> I'd argue against changing this. The current behavior's intent is not
>>>> to check whether the metadata is valid, it is to detect whether the table
>>>> is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>>>> would be surprising, especially if we started throwing exceptions.
>>>>
>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org>
>>>> wrote:
>>>>
>>>>> > Should add, my personal preference is probably not to change the
>>>>> existing behavior for this part
>>>>>
>>>>> +1. I realized that this is not a new behavior. The `loadTable`
>>>>> implementation has this problem too.
>>>>> It would be good to have a test case specifically for this edge case
>>>>> and maybe call this out in the documentation.
>>>>>
>>>>> Thanks,
>>>>> Kevin Liu
>>>>>
>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Should add, my personal preference is probably not to change the
>>>>>> existing behavior for this part (false, if exists a Hive table with same
>>>>>> name) at the moment, just adding another possibility for consideration.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess
>>>>>>> a third option instead of returning true/false in this case, is to 
>>>>>>> change
>>>>>>> it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
>>>>>>> which I think is actually what this pr does?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> I think what Kevin says makes sense. However, it would then confuse
>>>>>>>> the opposite use case of this function. Let's assume that we change the
>>>>>>>> implementation of tableExists() to not load the table internally:
>>>>>>>>
>>>>>>>> if (tableExists(table_name)) {
>>>>>>>>     table = loadTable(table_name);
>>>>>>>> }
>>>>>>>>
>>>>>>>> Here, you find that the table exists but when you try to load it it
>>>>>>>> fails because it's not an Iceberg table. I don't think that any of 
>>>>>>>> these 2
>>>>>>>> are intuitive. I think the question here is how much an API of the 
>>>>>>>> Iceberg
>>>>>>>> table format should know about the existence of tables in other 
>>>>>>>> formats.
>>>>>>>>
>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>>> HMS
>>>>>>>>
>>>>>>>> Another interpretation of calling Catalog.tableExists() on an
>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
>>>>>>>> if
>>>>>>>> any of the 2 approaches are better than the other, I just wanted to 
>>>>>>>> show
>>>>>>>> that there is another side of the coin :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Gabor
>>>>>>>>
>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Steve,
>>>>>>>>>
>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>>>
>>>>>>>>> I do have a question about usage, though. Typically, I would use `
>>>>>>>>> tableExists` like this:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>>     table = createTable(table_name);
>>>>>>>>> }
>>>>>>>>> ```
>>>>>>>>> What happens when a Hive table with the same name already exists
>>>>>>>>> in the catalog? In the current implementation, `tableExists` would 
>>>>>>>>> return
>>>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>>>>> `NoSuchTableException`.
>>>>>>>>> This would cause the code above to attempt to create the table,
>>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>>>> HMS, perhaps it should return true even when a Hive table with the 
>>>>>>>>> same
>>>>>>>>> name exists.
>>>>>>>>>
>>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kevin Liu
>>>>>>>>>
>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> It's a good performance find and improvement.   Left some comment
>>>>>>>>>> on the PR.
>>>>>>>>>>
>>>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>>>>> supportive of it.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>>
>>>>>>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>>>>>>> thoughts.
>>>>>>>>>>>
>>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>>>> It returns true if the table can be loaded without exceptions.
>>>>>>>>>>> This behavior implies two checks:
>>>>>>>>>>>
>>>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>>>>
>>>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>>>>> separates
>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>>>>>>>> corrupted). Such a change could improve the performance of existence
>>>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>>>> abstracted as
>>>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>>>
>>>>>>>>>>> I also reviewed the current usage of the tableExists API in the
>>>>>>>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>>>>>>>> negative impact.
>>>>>>>>>>>
>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>>
>>>>>>>>>>> [2]:
>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Steve Zhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to