Ive commented on the PR with a possible way forward. Its a bit weird because of the reliance on Spark imports, I would prefer to return names or Class objects rather than the actual `Procedure` from the iceberg Catalog.
Let me know what you think! Best, Ryan On Mon, Nov 29, 2021 at 2:04 PM Ryan Blue <b...@tabular.io> wrote: > Yeah, that sounds about right. I'm not sure how we would want to do it > exactly, but I think the catalog would be able to override the procedures > it wants to. > > On Fri, Nov 26, 2021 at 9:44 AM Ryan Murray <rym...@gmail.com> wrote: > >> Hey Ryan, >> >> Thanks for the suggestion. That makes a lot of sense. The immediate use >> case for Nessie is to supply a nessie-ified version of the expire snapshots >> action which considers all branches. tbh this is something that likely can >> be (partially) merged to iceberg once the branching/tagging work is further >> along but I don't think it makes sense to couple the procedure stuff too >> strongly to the branching proposal yet. >> >> I will amend my PR above to reflect this suggestion. I guess you are >> thinking the SparkCatalog would ask the icebergCatalog if it implements >> SupportsProcedures and if it does will look up a procedure there before >> going to the static collection? >> >> Best, >> Ryan >> >> On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote: >> >>> Thanks for the additional context. #1 makes more sense now. What are you >>> thinking about exposing that isn't in the standard set of procedures? Are >>> there procedures that you'd want for an Iceberg table when using Nessie as >>> the catalog? Would it work to add a SupportsProcedures interface on the >>> Iceberg catalog so you can plug in through there? That sounds like the >>> cleanest way to do what you're talking about, at least to me. >>> >>> For modifying the existing procedures, we could look at plugging in >>> through the Iceberg catalog rather than directly in the Spark catalog as >>> well. >>> >>> On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote: >>> >>>> Hey Ryan, >>>> >>>> Thanks for the follow up. As I see it the use cases are as follows: >>>> >>>> 1) add more iceberg specific procedures to the existing catalog. This >>>> is when we are actually operating on iceberg tables (not "to expose custom >>>> stored procedures for non-Iceberg systems") >>>> 2) modify existing OSS iceberg procedures. eg custom versions of >>>> compaction or a migrate table procedure that also re-writes the data >>>> 3) disable existing OSS iceberg procedures. eg regular users shouldn't >>>> be running the snapshot action. Or ensuring that regular users with write >>>> permissions can't also execute potentially destructive actions (eg expire) >>>> >>>> For use cases two or three I see (today) no other option than to extend >>>> org.apache.iceberg.spark.SparkCatalog and for the first use case we can >>>> either extend SparkCatalog or we can implement a whole new ProcedureCatalog >>>> as you suggested above. I think both options are rather counter-intuitive >>>> for users as they have to register extra or different catalogs and just >>>> following our(icebergs) online documentation isn't going to work for them. >>>> Additionally, following the online documentation still leads them to >>>> potentially executing the wrong procedures or destructive procedures. >>>> >>>> I have raised [1] to discuss concrete ways to address this. >>>> >>>> Best, >>>> Ryan >>>> >>>> [1] https://github.com/apache/iceberg/pull/3579 >>>> >>>> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I don’t see the code where the spark extensions can find other >>>>> procedure catalogs w/o the user having to configure and reference another >>>>> catalog. >>>>> >>>>> Yes, that’s right. If other systems want to add stored procedures, >>>>> then they would need to add a catalog. Is there a strong use case around >>>>> adding more procedures to Iceberg’s catalogs themselves? Maybe I just >>>>> don’t >>>>> understand your use case? >>>>> >>>>> I don’t see the problem with registering a catalog to expose custom >>>>> stored procedures for non-Iceberg systems. But let’s start talking about a >>>>> specific case that you have in mind. >>>>> >>>>> Maybe some examples of using ProcedureCatalog by an external project >>>>> by just having dependency on Iceberg jars >>>>> >>>>> >>>>> 1. Implement ProcedureCatalog >>>>> 2. Configure a Spark catalog that uses your ProcedureCatalog >>>>> implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog) >>>>> 3. Call procedures using that catalog: CALL pcat.some_function() >>>>> >>>>> >>>>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote: >>>>> >>>>>> Thanks Ryan for the response. >>>>>> >>>>>> Maybe I am misunderstanding here, apologies for that. However, I >>>>>> don't see the code where the spark extensions can find other procedure >>>>>> catalogs w/o the user having to configure and reference another catalog. >>>>>> >>>>>> Thinking about it more I think the goal of this discussion is to find >>>>>> a way for operators and vendors to expose a set of procedures to end >>>>>> users >>>>>> w/o resorting to special end user config or forks. Currently there is no >>>>>> way to turn off, replace or add to the set of procedures currently >>>>>> shipped >>>>>> with the iceberg runtime jar. Some of the use cases I envision here are: >>>>>> users w/ permission to append to a table but shouldn't be running >>>>>> maintenance procedures or a custom compaction job rather than the one >>>>>> shipped in iceberg. The only option as I see it is add new >>>>>> ProcedureCatalogs and hope end users don't run the existing procedures >>>>>> that >>>>>> are shipped with the catalog they are already using to read/write data. >>>>>> >>>>>> Best, >>>>>> Ryan >>>>>> >>>>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> I think there's a bit of a misunderstanding here. You shouldn't need >>>>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The >>>>>>> Iceberg >>>>>>> Spark extensions should support stored procedures exposed by any catalog >>>>>>> plugin that implements `ProcedureCatalog` across the Spark versions >>>>>>> where >>>>>>> Iceberg has stored procedures. Since the API should be nearly the same, >>>>>>> it >>>>>>> will be easy to update when Spark supports `ProcedureCatalog` directly. >>>>>>> >>>>>>> That translates to less code for Iceberg to manage and no long-term >>>>>>> debt supporting procedures plugged in through Iceberg instead of >>>>>>> through a >>>>>>> Spark interface. >>>>>>> >>>>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Ryan, >>>>>>>> >>>>>>>> What is the timeline for ProcedureCatalog to be moved into Spark >>>>>>>> and will it be backported? I agree 100% that its the 'correct' way to >>>>>>>> go >>>>>>>> long term but currently Iceberg has a `static final Map`[1] of valid >>>>>>>> procedures and no way for users to customize that. I personally don't >>>>>>>> love >>>>>>>> a static map regardless of wanting to register custom procedures, it's >>>>>>>> too >>>>>>>> error prone for maintainers and procedure developers (especially now >>>>>>>> that >>>>>>>> there are 3 versions of it in the codebase). Additionally, asking >>>>>>>> teams to >>>>>>>> extend SparkCatalog, with all the associated Spark config changes for >>>>>>>> end >>>>>>>> users, just to add custom procedures seems a bit heavy handed compared >>>>>>>> to >>>>>>>> the relatively small change to add a registration mechanism to the >>>>>>>> existing >>>>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and >>>>>>>> whatever >>>>>>>> future spark versions before the Procedure Catalog is upstreamed). >>>>>>>> >>>>>>>> There appears already to be a private fork of iceberg using >>>>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your >>>>>>>> comments on [3] and I am wondering if there is a middle ground with a >>>>>>>> registration method in line with Spark and/or a way for iceberg >>>>>>>> catalogs to >>>>>>>> specify its procedures (though I am not sure how to do so in a >>>>>>>> cross-engine >>>>>>>> way). My goal here is to avoid having custom implementations of >>>>>>>> SparkCatalog for everyone who may be interested in adding their own >>>>>>>> procedures. What do you think? >>>>>>>> >>>>>>>> Best, >>>>>>>> Ryan >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29 >>>>>>>> [2] >>>>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848 >>>>>>>> [3] https://github.com/apache/iceberg/pull/3367 >>>>>>>> >>>>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> I think that probably the best way to handle this use case is to >>>>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what >>>>>>>>> we >>>>>>>>> want to get upstream into Spark and is a really reasonable (and small) >>>>>>>>> addition to Spark. >>>>>>>>> >>>>>>>>> The problem with adding pluggable procedures to Iceberg is that it >>>>>>>>> is really working around the fact that Spark doesn't support plugging >>>>>>>>> in >>>>>>>>> procedures yet. This is specific to Spark and we would have to keep it >>>>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't >>>>>>>>> seem >>>>>>>>> worth the additional complexity in Iceberg, when you can plug in >>>>>>>>> through >>>>>>>>> the API intended to be Spark's own plugin API, if that makes sense. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat < >>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Community! >>>>>>>>>> >>>>>>>>>> If Iceberg provides a capability to plugin procedures, it will be >>>>>>>>>> really helpful for users to plugin their own spark actions to handle >>>>>>>>>> their >>>>>>>>>> business logic around Iceberg tables. >>>>>>>>>> So, can we have a mechanism that allows plugging additional >>>>>>>>>> implementations of >>>>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure >>>>>>>>>> * >>>>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just >>>>>>>>>> dropping an additional jar ? >>>>>>>>>> >>>>>>>>>> Without this feature, users can still add their custom procedure >>>>>>>>>> by extending *SparkCatalog* and/or *SparkSessionCatalog* and >>>>>>>>>> override >>>>>>>>>> *loadProcedure. *Which requires users to configure the >>>>>>>>>> subclasses of Spark[Session]Catalog in their Spark configuration. >>>>>>>>>> This way >>>>>>>>>> it is a lot of work and it is not a clean way to handle this. >>>>>>>>>> >>>>>>>>>> Another option is to add these custom procedures as UDF, but UDF >>>>>>>>>> is meant to be column related. It doesn't make sense to have UDF for >>>>>>>>>> spark >>>>>>>>>> actions. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *So, I want to know what most of you think about having pluggable >>>>>>>>>> procedures in Iceberg? Does this feature solve your problems too?* >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ajantha >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >