Hey Sung, Thanks for raising this. This was also for a very long time on my list, but I was reluctant to do this because of the incompatible change as you already mentioned, however, I think it is good to remove this rather sooner than later. I just went over the PR <https://github.com/apache/iceberg-python/pull/994>, and it looks good.
For the use case that Kevin describes, I think it makes sense to implement that on the SQL parsing side. There is little value in being able to prepend the catalog name. Kind regards, Fokko Op di 6 aug 2024 om 21:45 schreef Sung Yun <sungwy...@gmail.com>: > Hi Kevin, > > > The only potential issue I can see is if, for some reason, we would > want to refer to multiple catalogs in the same statement > > Im in agreement with you. I can see how this could be useful for a session > level expression parser. But for now, we are supporting identifier and > row_filter parsing at the catalog and then at the table level APIs > respectively. I think it would be easier to remove them from these levels > and then add support for parsing them at the session level if we ever > decide to introduce that feature in the future. > > > Perhaps we can start with a WARNING log whenever a catalog name is used > while allowing table identifiers both with and without the catalog name. > Then, in a future release, we can completely remove the catalog name. > > Glad we are on the same page here as well. I’ve already taken a step at > producing a deprecation warning in 0.8.0 release on this PR: > https://github.com/apache/iceberg-python/pull/994 > > Sung > > On Tue, Aug 6, 2024 at 3:37 PM Kevin Liu <kevin.jq....@gmail.com> wrote: > >> >> Thanks, Sung for the great writeup and explaining the context. >> >> >> I am +1 to remove the catalog name as part of the table identifier. In >> PyIceberg, each instance of a catalog has an associated name. So any >> functions of the catalog instance must be related to that catalog name. In >> the case where a single REST endpoint can represent many different catalog >> instances, we can use multiple instances of the catalog object. This is >> similar to Trino/Spark’s `USE <catalog>` feature. >> >> >> The only potential issue I can see is if, for some reason, we would want >> to refer to multiple catalogs in the same statement, i.e. “SELECT * FROM >> catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this >> being used in any of the current functions. >> >> >> Regarding the deprecation, I think it’ll be good to do it in multiple >> steps. Perhaps we can start with a WARNING log whenever a catalog name is >> used while allowing table identifiers both with and without the catalog >> name. Then, in a future release, we can completely remove the catalog name. >> >> >> >> Thanks, >> >> Kevin Liu >> >> On Wed, Jul 31, 2024 at 2:56 PM Sung Yun <sungwy...@gmail.com> wrote: >> >>> Today in PyIceberg, we have support for identifier parsing in public >>> APIs belonging to two different classes: >>> >>> >>> - Catalog class: load_table, purge_table, drop_table >>> - Table class: scan >>> >>> >>> These APIs currently have optional support for the identifier that the >>> instance itself belongs to. >>> >>> For example, the catalog class APIs support: >>> >>> >>> *catalog = load_catalog(“animals”, >>> **properties)catalog.load_table(“cats.whiskers”)* >>> >>> But it also supports: >>> >>> *catalog.load_table(“animals.cats.whiskers”)* >>> >>> Which is redundant, because the catalog.name is already “animals”. >>> >>> Similarly, row_filter in the Table scan API supports: >>> >>> >>> *table = >>> catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs == 4”)* >>> >>> But we also support >>> >>> >>> *table.scan(row_filter=”whiskers.n_legs == 4”)* >>> Which is also redundant, because the table name is already “whiskers” >>> (or cats.whiskers) >>> >>> While it sounds like a good change, I’d still like to open this thread >>> to discuss the possibility of removing this optional support for the >>> instance-level identifier as it will result in a backward incompatible API >>> behavior change. >>> >>> The benefits of this change are as follows: >>> >>> - As observed above, specifying instance-level identifier in these >>> APIs is redundant >>> - This optional support adds a lot of complexity to the code base >>> and leads to issues like: #742 >>> <https://github.com/apache/iceberg-python/issues/742> It would be >>> really great to clean this up before as we prepare for a 1.0.0 later this >>> year >>> - The optional support opens up the possibility of resulting in >>> correctness issues if there exists a name in the level below as the >>> instance-level identifier. >>> - For example, if in the above catalog, we have a table namespace >>> named “animals.lower” catalog.load(“animals.lower.cats”) can be >>> construed >>> as table name “cats” in the namespace “animals.lower” but it will be >>> interpreted as table name “cats” in the namespace “lower” which is >>> erroneous. >>> - We would see a similar issue with tables and field names as >>> well. Field name parsing is already complicated because we have to >>> represent nested fields as flat representations. So it would be great >>> to >>> remove one unnecessary level of complication here >>> >>> >>> I'd love to hear from the community on their thoughts on this topic. If >>> there are any folks in the community using the optional feature, it would >>> be especially great to hear from you as well, on what this change will mean >>> for your applications. >>> >>> Related PR: #963 <https://github.com/apache/iceberg-python/pull/963> >>> >>> Sung >>> >>