Hey Ryan, What is the timeline for ProcedureCatalog to be moved into Spark and will it be backported? I agree 100% that its the 'correct' way to go long term but currently Iceberg has a `static final Map`[1] of valid procedures and no way for users to customize that. I personally don't love a static map regardless of wanting to register custom procedures, it's too error prone for maintainers and procedure developers (especially now that there are 3 versions of it in the codebase). Additionally, asking teams to extend SparkCatalog, with all the associated Spark config changes for end users, just to add custom procedures seems a bit heavy handed compared to the relatively small change to add a registration mechanism to the existing ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever future spark versions before the Procedure Catalog is upstreamed).
There appears already to be a private fork of iceberg using ServiceLoader[2] and a draft PR using similar[3]. I agree with your comments on [3] and I am wondering if there is a middle ground with a registration method in line with Spark and/or a way for iceberg catalogs to specify its procedures (though I am not sure how to do so in a cross-engine way). My goal here is to avoid having custom implementations of SparkCatalog for everyone who may be interested in adding their own procedures. What do you think? Best, Ryan [1] https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29 [2] https://github.com/apache/iceberg/issues/3254#issuecomment-943845848 [3] https://github.com/apache/iceberg/pull/3367 On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote: > I think that probably the best way to handle this use case is to have > people implement the Iceberg `ProcedureCatalog` API. That's what we want to > get upstream into Spark and is a really reasonable (and small) addition to > Spark. > > The problem with adding pluggable procedures to Iceberg is that it is > really working around the fact that Spark doesn't support plugging in > procedures yet. This is specific to Spark and we would have to keep it > alive well past when we get `ProcedureCatalog` upstream. It doesn't seem > worth the additional complexity in Iceberg, when you can plug in through > the API intended to be Spark's own plugin API, if that makes sense. > > Ryan > > On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Hi Community! >> >> If Iceberg provides a capability to plugin procedures, it will be really >> helpful for users to plugin their own spark actions to handle their >> business logic around Iceberg tables. >> So, can we have a mechanism that allows plugging additional >> implementations of *org.apache.spark.sql.connector.iceberg.catalog.Procedure >> * >> for all users of SparkCatalog and SparkSessionCatalog by just dropping an >> additional jar ? >> >> Without this feature, users can still add their custom procedure by >> extending *SparkCatalog* and/or *SparkSessionCatalog* and override >> *loadProcedure. *Which requires users to configure the subclasses of >> Spark[Session]Catalog in their Spark configuration. This way it is a lot of >> work and it is not a clean way to handle this. >> >> Another option is to add these custom procedures as UDF, but UDF is meant >> to be column related. It doesn't make sense to have UDF for spark actions. >> >> >> *So, I want to know what most of you think about having pluggable >> procedures in Iceberg? Does this feature solve your problems too?* >> >> Thanks, >> Ajantha >> > > > -- > Ryan Blue > Tabular >