+1 from my side too.

I wanted to make sure that the community is aware of this change which will
bring behavioral difference compared to other catalogs. This is why I have
asked Steve to start this thread.


On Thu, Nov 28, 2024, 02:10 Szehon Ho <szehon.apa...@gmail.com> wrote:

> Yea, I think that part is definitely kept.
>
> Thanks
> Szehon
>
> On Wed, Nov 27, 2024 at 12:02 PM rdb...@gmail.com <rdb...@gmail.com>
> wrote:
>
>> I'd support changing the behavior if we still have a way to match the
>> intent, which is to return true if the table exists in Hive and is an
>> Iceberg table.
>>
>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>>> Hm I think the thread got a bit sidetracked by the other question.
>>>
>>> The initial proposal by Steve is a performance improvement for
>>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>>> table metadata, and if successful returns true.  The proposal is to load
>>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>>> table exists with this name.
>>>
>>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>>> the current behavior, but would not anymore with the proposed change.
>>> That's the question of the original thread, and so far there's agreement
>>> that it is not necessarily part of this scope of HiveCatalog's
>>> tableExists().
>>>
>>> At least this is my understanding.
>>> Thanks,
>>> Szehon
>>>
>>> On Wed, Nov 27, 2024 at 10:56 AM rdb...@gmail.com <rdb...@gmail.com>
>>> wrote:
>>>
>>>> What kind of corruption are you referring to? I would expect corruption
>>>> to result in an exception when loading the table, but that the table should
>>>> still exist. The problem is likely that we determine if a table exists by
>>>> attempting to load it. We could fix that by not attempting to load the
>>>> table. I think that's a reasonable solution.
>>>>
>>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> The current behavior's intent is not to check whether the metadata is
>>>>>> valid, it is to detect whether the table is an Iceberg table.
>>>>>
>>>>>
>>>>> Is there a way to detect this from HiveCatalog without loading the
>>>>> table?
>>>>>
>>>>>
>>>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> I think we have an agreement, not to change the behavior wrt existing
>>>>>> non-Iceberg tables, and throw an exception.
>>>>>>
>>>>>> Are we also in agreement with the original proposal to return true
>>>>>> when the table exists but the metadata is somehow corrupted? Note: this 
>>>>>> is
>>>>>> the proposed change of behavior why the thread was originally started.
>>>>>>
>>>>>> On Tue, Nov 26, 2024, 21:30 rdb...@gmail.com <rdb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'd argue against changing this. The current behavior's intent is
>>>>>>> not to check whether the metadata is valid, it is to detect whether the
>>>>>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing that
>>>>>>> behavior would be surprising, especially if we started throwing 
>>>>>>> exceptions.
>>>>>>>
>>>>>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu <kevinjq...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > Should add, my personal preference is probably not to change the
>>>>>>>> existing behavior for this part
>>>>>>>>
>>>>>>>> +1. I realized that this is not a new behavior. The `loadTable`
>>>>>>>> implementation has this problem too.
>>>>>>>> It would be good to have a test case specifically for this edge
>>>>>>>> case and maybe call this out in the documentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kevin Liu
>>>>>>>>
>>>>>>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Should add, my personal preference is probably not to change the
>>>>>>>>> existing behavior for this part (false, if exists a Hive table with 
>>>>>>>>> same
>>>>>>>>> name) at the moment, just adding another possibility for 
>>>>>>>>> consideration.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Kevin and Gabor, this is an interesting discussion.  I
>>>>>>>>>> guess a third option instead of returning true/false in this case, 
>>>>>>>>>> is to
>>>>>>>>>> change it to throw an NoSuchIcebergTableException if its a 
>>>>>>>>>> non-Iceberg
>>>>>>>>>> table, which I think is actually what this pr does?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>>>>>>>> <gaborkas...@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> I think what Kevin says makes sense. However, it would then
>>>>>>>>>>> confuse the opposite use case of this function. Let's assume that 
>>>>>>>>>>> we change
>>>>>>>>>>> the implementation of tableExists() to not load the table 
>>>>>>>>>>> internally:
>>>>>>>>>>>
>>>>>>>>>>> if (tableExists(table_name)) {
>>>>>>>>>>>     table = loadTable(table_name);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Here, you find that the table exists but when you try to load it
>>>>>>>>>>> it fails because it's not an Iceberg table. I don't think that any 
>>>>>>>>>>> of these
>>>>>>>>>>> 2 are intuitive. I think the question here is how much an API of the
>>>>>>>>>>> Iceberg table format should know about the existence of tables in 
>>>>>>>>>>> other
>>>>>>>>>>> formats.
>>>>>>>>>>>
>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in
>>>>>>>>>>>> the HMS
>>>>>>>>>>>
>>>>>>>>>>> Another interpretation of calling Catalog.tableExists() on an
>>>>>>>>>>> Iceberg API is instead "is there such an Iceberg table". TBH, not 
>>>>>>>>>>> sure if
>>>>>>>>>>> any of the 2 approaches are better than the other, I just wanted to 
>>>>>>>>>>> show
>>>>>>>>>>> that there is another side of the coin :)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Gabor
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu <
>>>>>>>>>>> kevin.jq....@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Steve,
>>>>>>>>>>>>
>>>>>>>>>>>> This makes sense to me. The semantics of `tableExists` focus on
>>>>>>>>>>>> whether a table's name exists in the catalog. For the Hive catalog,
>>>>>>>>>>>> checking the HMS entry should be sufficient.
>>>>>>>>>>>>
>>>>>>>>>>>> I do have a question about usage, though. Typically, I would
>>>>>>>>>>>> use `tableExists` like this:
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>> if (!tableExists(table_name)) {
>>>>>>>>>>>>     table = createTable(table_name);
>>>>>>>>>>>> }
>>>>>>>>>>>> ```
>>>>>>>>>>>> What happens when a Hive table with the same name already
>>>>>>>>>>>> exists in the catalog? In the current implementation, 
>>>>>>>>>>>> `tableExists` would
>>>>>>>>>>>> return `false` because `HiveOperationsBase.validateTableIsIceberg` 
>>>>>>>>>>>> throws a
>>>>>>>>>>>> `NoSuchTableException`.
>>>>>>>>>>>> This would cause the code above to attempt to create the table,
>>>>>>>>>>>> only to fail since the name already exists in the HMS.
>>>>>>>>>>>> If `tableExists` is meant to check for conflicting entries in
>>>>>>>>>>>> the HMS, perhaps it should return true even when a Hive table with 
>>>>>>>>>>>> the same
>>>>>>>>>>>> name exists.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d love to hear your thoughts on this.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kevin Liu
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho <
>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a good performance find and improvement.   Left some
>>>>>>>>>>>>> comment on the PR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> IMO, the behavior actually more matches the API javadoc
>>>>>>>>>>>>> ("Check whether table exists"), not whether it is corrupted or 
>>>>>>>>>>>>> not, so I'm
>>>>>>>>>>>>> supportive of it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>>>>>>>>>>> <hongyue_zh...@apple.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Iceberger,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   I have a proposal to simplify the tableExists API in the
>>>>>>>>>>>>>> Hive catalog, which involves a behavior change, and I’d like to 
>>>>>>>>>>>>>> hear your
>>>>>>>>>>>>>> thoughts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   Currently, in our catalog interface[1], the tableExists method
>>>>>>>>>>>>>> is implemented as a default API by invoking the loadTable method.
>>>>>>>>>>>>>> It returns true if the table can be loaded without
>>>>>>>>>>>>>> exceptions. This behavior implies two checks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. The table entry exists in the catalog.
>>>>>>>>>>>>>>    2. The latest metadata.json for the table is not
>>>>>>>>>>>>>>    corrupted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   The behavior change I’m proposing focuses only on the first
>>>>>>>>>>>>>> condition—checking if the table entry exists in the catalog. 
>>>>>>>>>>>>>> This separates
>>>>>>>>>>>>>> the concerns of table existence and table health (e.g., metadata 
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> corrupted). Such a change could improve the performance of 
>>>>>>>>>>>>>> existence
>>>>>>>>>>>>>> checks, especially for RESTcatalog where table existence is 
>>>>>>>>>>>>>> abstracted as
>>>>>>>>>>>>>> an HTTP HEAD request [2].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also reviewed the current usage of the tableExists API in
>>>>>>>>>>>>>> the Iceberg codebase to ensure that this optimization would not 
>>>>>>>>>>>>>> have any
>>>>>>>>>>>>>> negative impact.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’d love to hear everyone’s feedback on this! If there’s
>>>>>>>>>>>>>> consensus, I can follow up with a similar optimization for the
>>>>>>>>>>>>>> viewExists method in the Hive catalog.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2]:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Steve Zhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Reply via email to