>
> The current behavior's intent is not to check whether the metadata is
> valid, it is to detect whether the table is an Iceberg table.


Is there a way to detect this from HiveCatalog without loading the table?


On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> I think we have an agreement, not to change the behavior wrt existing
> non-Iceberg tables, and throw an exception.
>
> Are we also in agreement with the original proposal to return true when
> the table exists but the metadata is somehow corrupted? Note: this is the
> proposed change of behavior why the thread was originally started.
>
> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com> wrote:
>
>> I'd argue against changing this. The current behavior's intent is not to
>> check whether the metadata is valid, it is to detect whether the table is
>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>> would be surprising, especially if we started throwing exceptions.
>>
>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org> wrote:
>>
>>> > Should add, my personal preference is probably not to change the
>>> existing behavior for this part
>>>
>>> +1. I realized that this is not a new behavior. The `loadTable`
>>> implementation has this problem too.
>>> It would be good to have a test case specifically for this edge case and
>>> maybe call this out in the documentation.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>> wrote:
>>>
>>>> Should add, my personal preference is probably not to change the
>>>> existing behavior for this part (false, if exists a Hive table with same
>>>> name) at the moment, just adding another possibility for consideration.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>>>>> third option instead of returning true/false in this case, is to change it
>>>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which 
>>>>> I
>>>>> think is actually what this pr does?
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> I think what Kevin says makes sense. However, it would then confuse
>>>>>> the opposite use case of this function. Let's assume that we change the
>>>>>> implementation of tableExists() to not load the table internally:
>>>>>>
>>>>>> if (tableExists(table_name)) {
>>>>>>     table = loadTable(table_name);
>>>>>> }
>>>>>>
>>>>>> Here, you find that the table exists but when you try to load it it
>>>>>> fails because it's not an Iceberg table. I don't think that any of these 
>>>>>> 2
>>>>>> are intuitive. I think the question here is how much an API of the 
>>>>>> Iceberg
>>>>>> table format should know about the existence of tables in other formats.
>>>>>>
>>>>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>>>>
>>>>>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>>>>>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>>>>>> the 2 approaches are better than the other, I just wanted to show that
>>>>>> there is another side of the coin :)
>>>>>>
>>>>>> Regards,
>>>>>> Gabor
>>>>>>
>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Steve,
>>>>>>>
>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>
>>>>>>> I do have a question about usage, though. Typically, I would use `
>>>>>>> tableExists` like this:
>>>>>>>
>>>>>>> ```
>>>>>>> if (!tableExists(table_name)) {
>>>>>>>     table = createTable(table_name);
>>>>>>> }
>>>>>>> ```
>>>>>>> What happens when a Hive table with the same name already exists in
>>>>>>> the catalog? In the current implementation, `tableExists` would return
>>>>>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>>>>>> `NoSuchTableException`.
>>>>>>> This would cause the code above to attempt to create the table, only
>>>>>>> to fail since the name already exists in the HMS.
>>>>>>> If `tableExists` is meant to check for conflicting entries in the
>>>>>>> HMS, perhaps it should return true even when a Hive table with the same
>>>>>>> name exists.
>>>>>>>
>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It's a good performance find and improvement.   Left some comment
>>>>>>>> on the PR.
>>>>>>>>
>>>>>>>> IMO, the behavior actually more matches the API javadoc ("Check
>>>>>>>> whether table exists"), not whether it is corrupted or not, so I'm
>>>>>>>> supportive of it.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Iceberger,
>>>>>>>>>
>>>>>>>>>   I have a proposal to simplify the tableExists API in the Hive
>>>>>>>>> catalog, which involves a behavior change, and I’d like to hear your
>>>>>>>>> thoughts.
>>>>>>>>>
>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>> It returns true if the table can be loaded without exceptions.
>>>>>>>>> This behavior implies two checks:
>>>>>>>>>
>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>    2. The latest metadata.json for the table is not corrupted.
>>>>>>>>>
>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>> condition—checking if the table entry exists in the catalog. This 
>>>>>>>>> separates
>>>>>>>>> the concerns of table existence and table health (e.g., metadata not
>>>>>>>>> corrupted). Such a change could improve the performance of existence
>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>> abstracted as
>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>
>>>>>>>>> I also reviewed the current usage of the tableExists API in the
>>>>>>>>> Iceberg codebase to ensure that this optimization would not have any
>>>>>>>>> negative impact.
>>>>>>>>>
>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>
>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>
>>>>>>>>> [2]:
>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Steve Zhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Reply via email to