Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Fokko Driesprong Thu, 08 Aug 2024 04:15:23 -0700

Hey Sung,

Thanks for raising this. This was also for a very long time on my list, but
I was reluctant to do this because of the incompatible change as you
already mentioned, however, I think it is good to remove this rather sooner
than later. I just went over the PR
<https://github.com/apache/iceberg-python/pull/994>, and it looks good.


For the use case that Kevin describes, I think it makes sense to implement
that on the SQL parsing side. There is little value in being able to
prepend the catalog name.

Kind regards,
Fokko

Op di 6 aug 2024 om 21:45 schreef Sung Yun <sungwy...@gmail.com>:

> Hi Kevin,
>
> > The only potential issue I can see is if, for some reason, we would
> want to refer to multiple catalogs in the same statement
>
> Im in agreement with you. I can see how this could be useful for a session
> level expression parser. But for now, we are supporting identifier and
> row_filter parsing at the catalog and then at the table level APIs
> respectively. I think it would be easier to remove them from these levels
> and then add support for parsing them at the session level if we ever
> decide to introduce that feature in the future.
>
> > Perhaps we can start with a WARNING log whenever a catalog name is used
> while allowing table identifiers both with and without the catalog name.
> Then, in a future release, we can completely remove the catalog name.
>
> Glad we are on the same page here as well. I’ve already taken a step at
> producing a deprecation warning in 0.8.0 release on this PR:
> https://github.com/apache/iceberg-python/pull/994
>
> Sung
>
> On Tue, Aug 6, 2024 at 3:37 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>
>>
>> Thanks, Sung for the great writeup and explaining the context.
>>
>>
>> I am +1 to remove the catalog name as part of the table identifier. In
>> PyIceberg, each instance of a catalog has an associated name. So any
>> functions of the catalog instance must be related to that catalog name. In
>> the case where a single REST endpoint can represent many different catalog
>> instances, we can use multiple instances of the catalog object. This is
>> similar to Trino/Spark’s `USE <catalog>` feature.
>>
>>
>> The only potential issue I can see is if, for some reason, we would want
>> to refer to multiple catalogs in the same statement, i.e. “SELECT * FROM
>> catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this
>> being used in any of the current functions.
>>
>>
>> Regarding the deprecation, I think it’ll be good to do it in multiple
>> steps. Perhaps we can start with a WARNING log whenever a catalog name is
>> used while allowing table identifiers both with and without the catalog
>> name. Then, in a future release, we can completely remove the catalog name.
>>
>>
>>
>> Thanks,
>>
>> Kevin Liu
>>
>> On Wed, Jul 31, 2024 at 2:56 PM Sung Yun <sungwy...@gmail.com> wrote:
>>
>>> Today in PyIceberg, we have support for identifier parsing in public
>>> APIs belonging to two different classes:
>>>
>>>
>>>    - Catalog class: load_table, purge_table, drop_table
>>>    - Table class: scan
>>>
>>>
>>> These APIs currently have optional support for the identifier that the
>>> instance itself belongs to.
>>>
>>> For example, the catalog class APIs support:
>>>
>>>
>>> *catalog = load_catalog(“animals”,
>>> **properties)catalog.load_table(“cats.whiskers”)*
>>>
>>> But it also supports:
>>>
>>> *catalog.load_table(“animals.cats.whiskers”)*
>>>
>>> Which is redundant, because the catalog.name is already “animals”.
>>>
>>> Similarly, row_filter in the Table scan API supports:
>>>
>>>
>>> *table =
>>> catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs == 4”)*
>>>
>>> But we also support
>>>
>>>
>>> *table.scan(row_filter=”whiskers.n_legs == 4”)*
>>> Which is also redundant, because the table name is already “whiskers”
>>> (or cats.whiskers)
>>>
>>> While it sounds like a good change, I’d still like to open this thread
>>> to discuss the possibility of removing this optional support for the
>>> instance-level identifier as it will result in a backward incompatible API
>>> behavior change.
>>>
>>> The benefits of this change are as follows:
>>>
>>>    - As observed above, specifying instance-level identifier in these
>>>    APIs is redundant
>>>    - This optional support adds a lot of complexity to the code base
>>>    and leads to issues like: #742
>>>    <https://github.com/apache/iceberg-python/issues/742> It would be
>>>    really great to clean this up before as we prepare for a 1.0.0 later this
>>>    year
>>>    - The optional support opens up the possibility of resulting in
>>>    correctness issues if there exists a name in the level below as the
>>>    instance-level identifier.
>>>       - For example, if in the above catalog, we have a table namespace
>>>       named “animals.lower” catalog.load(“animals.lower.cats”) can be 
>>> construed
>>>       as table name “cats” in the namespace “animals.lower” but it will be
>>>       interpreted as table name “cats” in the namespace “lower” which is
>>>       erroneous.
>>>       - We would see a similar issue with tables and field names as
>>>       well. Field name parsing is already complicated because we have to
>>>       represent nested fields as flat representations. So it would be great 
>>> to
>>>       remove one unnecessary level of complication here
>>>
>>>
>>> I'd love to hear from the community on their thoughts on this topic. If
>>> there are any folks in the community using the optional feature, it would
>>> be especially great to hear from you as well, on what this change will mean
>>> for your applications.
>>>
>>> Related PR: #963 <https://github.com/apache/iceberg-python/pull/963>
>>>
>>> Sung
>>>
>>

Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Reply via email to