Hey Ryan, Thanks for the suggestion. That makes a lot of sense. The immediate use case for Nessie is to supply a nessie-ified version of the expire snapshots action which considers all branches. tbh this is something that likely can be (partially) merged to iceberg once the branching/tagging work is further along but I don't think it makes sense to couple the procedure stuff too strongly to the branching proposal yet.
I will amend my PR above to reflect this suggestion. I guess you are thinking the SparkCatalog would ask the icebergCatalog if it implements SupportsProcedures and if it does will look up a procedure there before going to the static collection? Best, Ryan On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote: > Thanks for the additional context. #1 makes more sense now. What are you > thinking about exposing that isn't in the standard set of procedures? Are > there procedures that you'd want for an Iceberg table when using Nessie as > the catalog? Would it work to add a SupportsProcedures interface on the > Iceberg catalog so you can plug in through there? That sounds like the > cleanest way to do what you're talking about, at least to me. > > For modifying the existing procedures, we could look at plugging in > through the Iceberg catalog rather than directly in the Spark catalog as > well. > > On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote: > >> Hey Ryan, >> >> Thanks for the follow up. As I see it the use cases are as follows: >> >> 1) add more iceberg specific procedures to the existing catalog. This is >> when we are actually operating on iceberg tables (not "to expose custom >> stored procedures for non-Iceberg systems") >> 2) modify existing OSS iceberg procedures. eg custom versions of >> compaction or a migrate table procedure that also re-writes the data >> 3) disable existing OSS iceberg procedures. eg regular users shouldn't be >> running the snapshot action. Or ensuring that regular users with write >> permissions can't also execute potentially destructive actions (eg expire) >> >> For use cases two or three I see (today) no other option than to extend >> org.apache.iceberg.spark.SparkCatalog and for the first use case we can >> either extend SparkCatalog or we can implement a whole new ProcedureCatalog >> as you suggested above. I think both options are rather counter-intuitive >> for users as they have to register extra or different catalogs and just >> following our(icebergs) online documentation isn't going to work for them. >> Additionally, following the online documentation still leads them to >> potentially executing the wrong procedures or destructive procedures. >> >> I have raised [1] to discuss concrete ways to address this. >> >> Best, >> Ryan >> >> [1] https://github.com/apache/iceberg/pull/3579 >> >> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote: >> >>> I don’t see the code where the spark extensions can find other procedure >>> catalogs w/o the user having to configure and reference another catalog. >>> >>> Yes, that’s right. If other systems want to add stored procedures, then >>> they would need to add a catalog. Is there a strong use case around adding >>> more procedures to Iceberg’s catalogs themselves? Maybe I just don’t >>> understand your use case? >>> >>> I don’t see the problem with registering a catalog to expose custom >>> stored procedures for non-Iceberg systems. But let’s start talking about a >>> specific case that you have in mind. >>> >>> Maybe some examples of using ProcedureCatalog by an external project by >>> just having dependency on Iceberg jars >>> >>> >>> 1. Implement ProcedureCatalog >>> 2. Configure a Spark catalog that uses your ProcedureCatalog >>> implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog) >>> 3. Call procedures using that catalog: CALL pcat.some_function() >>> >>> >>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote: >>> >>>> Thanks Ryan for the response. >>>> >>>> Maybe I am misunderstanding here, apologies for that. However, I don't >>>> see the code where the spark extensions can find other procedure catalogs >>>> w/o the user having to configure and reference another catalog. >>>> >>>> Thinking about it more I think the goal of this discussion is to find a >>>> way for operators and vendors to expose a set of procedures to end users >>>> w/o resorting to special end user config or forks. Currently there is no >>>> way to turn off, replace or add to the set of procedures currently shipped >>>> with the iceberg runtime jar. Some of the use cases I envision here are: >>>> users w/ permission to append to a table but shouldn't be running >>>> maintenance procedures or a custom compaction job rather than the one >>>> shipped in iceberg. The only option as I see it is add new >>>> ProcedureCatalogs and hope end users don't run the existing procedures that >>>> are shipped with the catalog they are already using to read/write data. >>>> >>>> Best, >>>> Ryan >>>> >>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I think there's a bit of a misunderstanding here. You shouldn't need >>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg >>>>> Spark extensions should support stored procedures exposed by any catalog >>>>> plugin that implements `ProcedureCatalog` across the Spark versions where >>>>> Iceberg has stored procedures. Since the API should be nearly the same, it >>>>> will be easy to update when Spark supports `ProcedureCatalog` directly. >>>>> >>>>> That translates to less code for Iceberg to manage and no long-term >>>>> debt supporting procedures plugged in through Iceberg instead of through a >>>>> Spark interface. >>>>> >>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote: >>>>> >>>>>> Hey Ryan, >>>>>> >>>>>> What is the timeline for ProcedureCatalog to be moved into Spark and >>>>>> will it be backported? I agree 100% that its the 'correct' way to go long >>>>>> term but currently Iceberg has a `static final Map`[1] of valid >>>>>> procedures >>>>>> and no way for users to customize that. I personally don't love a static >>>>>> map regardless of wanting to register custom procedures, it's too error >>>>>> prone for maintainers and procedure developers (especially now that there >>>>>> are 3 versions of it in the codebase). Additionally, asking teams to >>>>>> extend >>>>>> SparkCatalog, with all the associated Spark config changes for end users, >>>>>> just to add custom procedures seems a bit heavy handed compared to the >>>>>> relatively small change to add a registration mechanism to the existing >>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever >>>>>> future spark versions before the Procedure Catalog is upstreamed). >>>>>> >>>>>> There appears already to be a private fork of iceberg using >>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your >>>>>> comments on [3] and I am wondering if there is a middle ground with a >>>>>> registration method in line with Spark and/or a way for iceberg catalogs >>>>>> to >>>>>> specify its procedures (though I am not sure how to do so in a >>>>>> cross-engine >>>>>> way). My goal here is to avoid having custom implementations of >>>>>> SparkCatalog for everyone who may be interested in adding their own >>>>>> procedures. What do you think? >>>>>> >>>>>> Best, >>>>>> Ryan >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29 >>>>>> [2] >>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848 >>>>>> [3] https://github.com/apache/iceberg/pull/3367 >>>>>> >>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> I think that probably the best way to handle this use case is to >>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what we >>>>>>> want to get upstream into Spark and is a really reasonable (and small) >>>>>>> addition to Spark. >>>>>>> >>>>>>> The problem with adding pluggable procedures to Iceberg is that it >>>>>>> is really working around the fact that Spark doesn't support plugging in >>>>>>> procedures yet. This is specific to Spark and we would have to keep it >>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem >>>>>>> worth the additional complexity in Iceberg, when you can plug in through >>>>>>> the API intended to be Spark's own plugin API, if that makes sense. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Community! >>>>>>>> >>>>>>>> If Iceberg provides a capability to plugin procedures, it will be >>>>>>>> really helpful for users to plugin their own spark actions to handle >>>>>>>> their >>>>>>>> business logic around Iceberg tables. >>>>>>>> So, can we have a mechanism that allows plugging additional >>>>>>>> implementations of >>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure >>>>>>>> * >>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just >>>>>>>> dropping an additional jar ? >>>>>>>> >>>>>>>> Without this feature, users can still add their custom procedure by >>>>>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override >>>>>>>> *loadProcedure. *Which requires users to configure the subclasses >>>>>>>> of Spark[Session]Catalog in their Spark configuration. This way it is >>>>>>>> a lot >>>>>>>> of work and it is not a clean way to handle this. >>>>>>>> >>>>>>>> Another option is to add these custom procedures as UDF, but UDF is >>>>>>>> meant to be column related. It doesn't make sense to have UDF for spark >>>>>>>> actions. >>>>>>>> >>>>>>>> >>>>>>>> *So, I want to know what most of you think about having pluggable >>>>>>>> procedures in Iceberg? Does this feature solve your problems too?* >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ajantha >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >