Yeah, that sounds about right. I'm not sure how we would want to do it exactly, but I think the catalog would be able to override the procedures it wants to.
On Fri, Nov 26, 2021 at 9:44 AM Ryan Murray <rym...@gmail.com> wrote: > Hey Ryan, > > Thanks for the suggestion. That makes a lot of sense. The immediate use > case for Nessie is to supply a nessie-ified version of the expire snapshots > action which considers all branches. tbh this is something that likely can > be (partially) merged to iceberg once the branching/tagging work is further > along but I don't think it makes sense to couple the procedure stuff too > strongly to the branching proposal yet. > > I will amend my PR above to reflect this suggestion. I guess you are > thinking the SparkCatalog would ask the icebergCatalog if it implements > SupportsProcedures and if it does will look up a procedure there before > going to the static collection? > > Best, > Ryan > > On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote: > >> Thanks for the additional context. #1 makes more sense now. What are you >> thinking about exposing that isn't in the standard set of procedures? Are >> there procedures that you'd want for an Iceberg table when using Nessie as >> the catalog? Would it work to add a SupportsProcedures interface on the >> Iceberg catalog so you can plug in through there? That sounds like the >> cleanest way to do what you're talking about, at least to me. >> >> For modifying the existing procedures, we could look at plugging in >> through the Iceberg catalog rather than directly in the Spark catalog as >> well. >> >> On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote: >> >>> Hey Ryan, >>> >>> Thanks for the follow up. As I see it the use cases are as follows: >>> >>> 1) add more iceberg specific procedures to the existing catalog. This is >>> when we are actually operating on iceberg tables (not "to expose custom >>> stored procedures for non-Iceberg systems") >>> 2) modify existing OSS iceberg procedures. eg custom versions of >>> compaction or a migrate table procedure that also re-writes the data >>> 3) disable existing OSS iceberg procedures. eg regular users shouldn't >>> be running the snapshot action. Or ensuring that regular users with write >>> permissions can't also execute potentially destructive actions (eg expire) >>> >>> For use cases two or three I see (today) no other option than to extend >>> org.apache.iceberg.spark.SparkCatalog and for the first use case we can >>> either extend SparkCatalog or we can implement a whole new ProcedureCatalog >>> as you suggested above. I think both options are rather counter-intuitive >>> for users as they have to register extra or different catalogs and just >>> following our(icebergs) online documentation isn't going to work for them. >>> Additionally, following the online documentation still leads them to >>> potentially executing the wrong procedures or destructive procedures. >>> >>> I have raised [1] to discuss concrete ways to address this. >>> >>> Best, >>> Ryan >>> >>> [1] https://github.com/apache/iceberg/pull/3579 >>> >>> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> I don’t see the code where the spark extensions can find other >>>> procedure catalogs w/o the user having to configure and reference another >>>> catalog. >>>> >>>> Yes, that’s right. If other systems want to add stored procedures, then >>>> they would need to add a catalog. Is there a strong use case around adding >>>> more procedures to Iceberg’s catalogs themselves? Maybe I just don’t >>>> understand your use case? >>>> >>>> I don’t see the problem with registering a catalog to expose custom >>>> stored procedures for non-Iceberg systems. But let’s start talking about a >>>> specific case that you have in mind. >>>> >>>> Maybe some examples of using ProcedureCatalog by an external project by >>>> just having dependency on Iceberg jars >>>> >>>> >>>> 1. Implement ProcedureCatalog >>>> 2. Configure a Spark catalog that uses your ProcedureCatalog >>>> implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog) >>>> 3. Call procedures using that catalog: CALL pcat.some_function() >>>> >>>> >>>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote: >>>> >>>>> Thanks Ryan for the response. >>>>> >>>>> Maybe I am misunderstanding here, apologies for that. However, I don't >>>>> see the code where the spark extensions can find other procedure catalogs >>>>> w/o the user having to configure and reference another catalog. >>>>> >>>>> Thinking about it more I think the goal of this discussion is to find >>>>> a way for operators and vendors to expose a set of procedures to end users >>>>> w/o resorting to special end user config or forks. Currently there is no >>>>> way to turn off, replace or add to the set of procedures currently shipped >>>>> with the iceberg runtime jar. Some of the use cases I envision here are: >>>>> users w/ permission to append to a table but shouldn't be running >>>>> maintenance procedures or a custom compaction job rather than the one >>>>> shipped in iceberg. The only option as I see it is add new >>>>> ProcedureCatalogs and hope end users don't run the existing procedures >>>>> that >>>>> are shipped with the catalog they are already using to read/write data. >>>>> >>>>> Best, >>>>> Ryan >>>>> >>>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> I think there's a bit of a misunderstanding here. You shouldn't need >>>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The >>>>>> Iceberg >>>>>> Spark extensions should support stored procedures exposed by any catalog >>>>>> plugin that implements `ProcedureCatalog` across the Spark versions where >>>>>> Iceberg has stored procedures. Since the API should be nearly the same, >>>>>> it >>>>>> will be easy to update when Spark supports `ProcedureCatalog` directly. >>>>>> >>>>>> That translates to less code for Iceberg to manage and no long-term >>>>>> debt supporting procedures plugged in through Iceberg instead of through >>>>>> a >>>>>> Spark interface. >>>>>> >>>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote: >>>>>> >>>>>>> Hey Ryan, >>>>>>> >>>>>>> What is the timeline for ProcedureCatalog to be moved into Spark and >>>>>>> will it be backported? I agree 100% that its the 'correct' way to go >>>>>>> long >>>>>>> term but currently Iceberg has a `static final Map`[1] of valid >>>>>>> procedures >>>>>>> and no way for users to customize that. I personally don't love a static >>>>>>> map regardless of wanting to register custom procedures, it's too error >>>>>>> prone for maintainers and procedure developers (especially now that >>>>>>> there >>>>>>> are 3 versions of it in the codebase). Additionally, asking teams to >>>>>>> extend >>>>>>> SparkCatalog, with all the associated Spark config changes for end >>>>>>> users, >>>>>>> just to add custom procedures seems a bit heavy handed compared to the >>>>>>> relatively small change to add a registration mechanism to the existing >>>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and >>>>>>> whatever >>>>>>> future spark versions before the Procedure Catalog is upstreamed). >>>>>>> >>>>>>> There appears already to be a private fork of iceberg using >>>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your >>>>>>> comments on [3] and I am wondering if there is a middle ground with a >>>>>>> registration method in line with Spark and/or a way for iceberg >>>>>>> catalogs to >>>>>>> specify its procedures (though I am not sure how to do so in a >>>>>>> cross-engine >>>>>>> way). My goal here is to avoid having custom implementations of >>>>>>> SparkCatalog for everyone who may be interested in adding their own >>>>>>> procedures. What do you think? >>>>>>> >>>>>>> Best, >>>>>>> Ryan >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29 >>>>>>> [2] >>>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848 >>>>>>> [3] https://github.com/apache/iceberg/pull/3367 >>>>>>> >>>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> I think that probably the best way to handle this use case is to >>>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what >>>>>>>> we >>>>>>>> want to get upstream into Spark and is a really reasonable (and small) >>>>>>>> addition to Spark. >>>>>>>> >>>>>>>> The problem with adding pluggable procedures to Iceberg is that it >>>>>>>> is really working around the fact that Spark doesn't support plugging >>>>>>>> in >>>>>>>> procedures yet. This is specific to Spark and we would have to keep it >>>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't >>>>>>>> seem >>>>>>>> worth the additional complexity in Iceberg, when you can plug in >>>>>>>> through >>>>>>>> the API intended to be Spark's own plugin API, if that makes sense. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Community! >>>>>>>>> >>>>>>>>> If Iceberg provides a capability to plugin procedures, it will be >>>>>>>>> really helpful for users to plugin their own spark actions to handle >>>>>>>>> their >>>>>>>>> business logic around Iceberg tables. >>>>>>>>> So, can we have a mechanism that allows plugging additional >>>>>>>>> implementations of >>>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure >>>>>>>>> * >>>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just >>>>>>>>> dropping an additional jar ? >>>>>>>>> >>>>>>>>> Without this feature, users can still add their custom procedure >>>>>>>>> by extending *SparkCatalog* and/or *SparkSessionCatalog* and >>>>>>>>> override >>>>>>>>> *loadProcedure. *Which requires users to configure the subclasses >>>>>>>>> of Spark[Session]Catalog in their Spark configuration. This way it is >>>>>>>>> a lot >>>>>>>>> of work and it is not a clean way to handle this. >>>>>>>>> >>>>>>>>> Another option is to add these custom procedures as UDF, but UDF >>>>>>>>> is meant to be column related. It doesn't make sense to have UDF for >>>>>>>>> spark >>>>>>>>> actions. >>>>>>>>> >>>>>>>>> >>>>>>>>> *So, I want to know what most of you think about having pluggable >>>>>>>>> procedures in Iceberg? Does this feature solve your problems too?* >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ajantha >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular