Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Sung Yun Tue, 06 Aug 2024 12:45:39 -0700

Hi Kevin,

> The only potential issue I can see is if, for some reason, we would want
to refer to multiple catalogs in the same statement


Im in agreement with you. I can see how this could be useful for a session
level expression parser. But for now, we are supporting identifier and
row_filter parsing at the catalog and then at the table level APIs
respectively. I think it would be easier to remove them from these levels
and then add support for parsing them at the session level if we ever
decide to introduce that feature in the future.

> Perhaps we can start with a WARNING log whenever a catalog name is used
while allowing table identifiers both with and without the catalog name.
Then, in a future release, we can completely remove the catalog name.

Glad we are on the same page here as well. I’ve already taken a step at
producing a deprecation warning in 0.8.0 release on this PR:
https://github.com/apache/iceberg-python/pull/994

Sung

On Tue, Aug 6, 2024 at 3:37 PM Kevin Liu <kevin.jq....@gmail.com> wrote:

>
> Thanks, Sung for the great writeup and explaining the context.
>
>
> I am +1 to remove the catalog name as part of the table identifier. In
> PyIceberg, each instance of a catalog has an associated name. So any
> functions of the catalog instance must be related to that catalog name. In
> the case where a single REST endpoint can represent many different catalog
> instances, we can use multiple instances of the catalog object. This is
> similar to Trino/Spark’s `USE <catalog>` feature.
>
>
> The only potential issue I can see is if, for some reason, we would want
> to refer to multiple catalogs in the same statement, i.e. “SELECT * FROM
> catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this
> being used in any of the current functions.
>
>
> Regarding the deprecation, I think it’ll be good to do it in multiple
> steps. Perhaps we can start with a WARNING log whenever a catalog name is
> used while allowing table identifiers both with and without the catalog
> name. Then, in a future release, we can completely remove the catalog name.
>
>
>
> Thanks,
>
> Kevin Liu
>
> On Wed, Jul 31, 2024 at 2:56 PM Sung Yun <sungwy...@gmail.com> wrote:
>
>> Today in PyIceberg, we have support for identifier parsing in public APIs
>> belonging to two different classes:
>>
>>
>>    - Catalog class: load_table, purge_table, drop_table
>>    - Table class: scan
>>
>>
>> These APIs currently have optional support for the identifier that the
>> instance itself belongs to.
>>
>> For example, the catalog class APIs support:
>>
>>
>> *catalog = load_catalog(“animals”,
>> **properties)catalog.load_table(“cats.whiskers”)*
>>
>> But it also supports:
>>
>> *catalog.load_table(“animals.cats.whiskers”)*
>>
>> Which is redundant, because the catalog.name is already “animals”.
>>
>> Similarly, row_filter in the Table scan API supports:
>>
>>
>> *table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs
>> == 4”)*
>>
>> But we also support
>>
>>
>> *table.scan(row_filter=”whiskers.n_legs == 4”)*
>> Which is also redundant, because the table name is already “whiskers” (or
>> cats.whiskers)
>>
>> While it sounds like a good change, I’d still like to open this thread to
>> discuss the possibility of removing this optional support for the
>> instance-level identifier as it will result in a backward incompatible API
>> behavior change.
>>
>> The benefits of this change are as follows:
>>
>>    - As observed above, specifying instance-level identifier in these
>>    APIs is redundant
>>    - This optional support adds a lot of complexity to the code base and
>>    leads to issues like: #742
>>    <https://github.com/apache/iceberg-python/issues/742> It would be
>>    really great to clean this up before as we prepare for a 1.0.0 later this
>>    year
>>    - The optional support opens up the possibility of resulting in
>>    correctness issues if there exists a name in the level below as the
>>    instance-level identifier.
>>       - For example, if in the above catalog, we have a table namespace
>>       named “animals.lower” catalog.load(“animals.lower.cats”) can be 
>> construed
>>       as table name “cats” in the namespace “animals.lower” but it will be
>>       interpreted as table name “cats” in the namespace “lower” which is
>>       erroneous.
>>       - We would see a similar issue with tables and field names as
>>       well. Field name parsing is already complicated because we have to
>>       represent nested fields as flat representations. So it would be great 
>> to
>>       remove one unnecessary level of complication here
>>
>>
>> I'd love to hear from the community on their thoughts on this topic. If
>> there are any folks in the community using the optional feature, it would
>> be especially great to hear from you as well, on what this change will mean
>> for your applications.
>>
>> Related PR: #963 <https://github.com/apache/iceberg-python/pull/963>
>>
>> Sung
>>
>

Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Reply via email to