Hi Ryan Blue and Ryan Murray,
*Thanks for giving your inputs. But I think we still need to conclude on this.* @Ryan Blue: > You shouldn't need to extend Iceberg's SparkCatalog to plug in stored > procedures. The Iceberg Spark extensions should support stored procedures > exposed by any catalog plugin that implements `ProcedureCatalog` across the > Spark versions where Iceberg has stored procedures *ProcedureCatalog::loadProcedure*, uses *ProcedureBuilder* which still needs a catalog. So, how to achieve it without extending the catalog? Maybe some examples of using *ProcedureCatalog by an external project by just having dependency on Iceberg jars* will be useful in clearing my doubts I guess. Thanks, Ajantha On Fri, Nov 12, 2021 at 4:50 PM Ryan Murray <rym...@gmail.com> wrote: > Thanks Ryan for the response. > > Maybe I am misunderstanding here, apologies for that. However, I don't see > the code where the spark extensions can find other procedure catalogs w/o > the user having to configure and reference another catalog. > > Thinking about it more I think the goal of this discussion is to find a > way for operators and vendors to expose a set of procedures to end users > w/o resorting to special end user config or forks. Currently there is no > way to turn off, replace or add to the set of procedures currently shipped > with the iceberg runtime jar. Some of the use cases I envision here are: > users w/ permission to append to a table but shouldn't be running > maintenance procedures or a custom compaction job rather than the one > shipped in iceberg. The only option as I see it is add new > ProcedureCatalogs and hope end users don't run the existing procedures that > are shipped with the catalog they are already using to read/write data. > > Best, > Ryan > > On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote: > >> I think there's a bit of a misunderstanding here. You shouldn't need to >> extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg >> Spark extensions should support stored procedures exposed by any catalog >> plugin that implements `ProcedureCatalog` across the Spark versions where >> Iceberg has stored procedures. Since the API should be nearly the same, it >> will be easy to update when Spark supports `ProcedureCatalog` directly. >> >> That translates to less code for Iceberg to manage and no long-term debt >> supporting procedures plugged in through Iceberg instead of through a Spark >> interface. >> >> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote: >> >>> Hey Ryan, >>> >>> What is the timeline for ProcedureCatalog to be moved into Spark and >>> will it be backported? I agree 100% that its the 'correct' way to go long >>> term but currently Iceberg has a `static final Map`[1] of valid procedures >>> and no way for users to customize that. I personally don't love a static >>> map regardless of wanting to register custom procedures, it's too error >>> prone for maintainers and procedure developers (especially now that there >>> are 3 versions of it in the codebase). Additionally, asking teams to extend >>> SparkCatalog, with all the associated Spark config changes for end users, >>> just to add custom procedures seems a bit heavy handed compared to the >>> relatively small change to add a registration mechanism to the existing >>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever >>> future spark versions before the Procedure Catalog is upstreamed). >>> >>> There appears already to be a private fork of iceberg using >>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your >>> comments on [3] and I am wondering if there is a middle ground with a >>> registration method in line with Spark and/or a way for iceberg catalogs to >>> specify its procedures (though I am not sure how to do so in a cross-engine >>> way). My goal here is to avoid having custom implementations of >>> SparkCatalog for everyone who may be interested in adding their own >>> procedures. What do you think? >>> >>> Best, >>> Ryan >>> >>> [1] >>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29 >>> [2] https://github.com/apache/iceberg/issues/3254#issuecomment-943845848 >>> [3] https://github.com/apache/iceberg/pull/3367 >>> >>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I think that probably the best way to handle this use case is to have >>>> people implement the Iceberg `ProcedureCatalog` API. That's what we want to >>>> get upstream into Spark and is a really reasonable (and small) addition to >>>> Spark. >>>> >>>> The problem with adding pluggable procedures to Iceberg is that it is >>>> really working around the fact that Spark doesn't support plugging in >>>> procedures yet. This is specific to Spark and we would have to keep it >>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem >>>> worth the additional complexity in Iceberg, when you can plug in through >>>> the API intended to be Spark's own plugin API, if that makes sense. >>>> >>>> Ryan >>>> >>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Hi Community! >>>>> >>>>> If Iceberg provides a capability to plugin procedures, it will be >>>>> really helpful for users to plugin their own spark actions to handle their >>>>> business logic around Iceberg tables. >>>>> So, can we have a mechanism that allows plugging additional >>>>> implementations of >>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure >>>>> * >>>>> for all users of SparkCatalog and SparkSessionCatalog by just dropping >>>>> an additional jar ? >>>>> >>>>> Without this feature, users can still add their custom procedure by >>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override >>>>> *loadProcedure. *Which requires users to configure the subclasses of >>>>> Spark[Session]Catalog in their Spark configuration. This way it is a lot >>>>> of >>>>> work and it is not a clean way to handle this. >>>>> >>>>> Another option is to add these custom procedures as UDF, but UDF is >>>>> meant to be column related. It doesn't make sense to have UDF for spark >>>>> actions. >>>>> >>>>> >>>>> *So, I want to know what most of you think about having pluggable >>>>> procedures in Iceberg? Does this feature solve your problems too?* >>>>> >>>>> Thanks, >>>>> Ajantha >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >