Re: Regarding the support of pluggable procedures in Iceberg

Ryan Blue Mon, 29 Nov 2021 11:04:44 -0800

Yeah, that sounds about right. I'm not sure how we would want to do it
exactly, but I think the catalog would be able to override the procedures
it wants to.


On Fri, Nov 26, 2021 at 9:44 AM Ryan Murray <rym...@gmail.com> wrote:

> Hey Ryan,
>
> Thanks for the suggestion. That makes a lot of sense. The immediate use
> case for Nessie is to supply a nessie-ified version of the expire snapshots
> action which considers all branches. tbh this is something that likely can
> be (partially) merged to iceberg once the branching/tagging work is further
> along but I don't think it makes sense to couple the procedure stuff too
> strongly to the branching proposal yet.
>
> I will amend my PR above to reflect this suggestion. I guess you are
> thinking the SparkCatalog would ask the icebergCatalog if it implements
> SupportsProcedures and if it does will look up a procedure there before
> going to the static collection?
>
> Best,
> Ryan
>
> On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Thanks for the additional context. #1 makes more sense now. What are you
>> thinking about exposing that isn't in the standard set of procedures? Are
>> there procedures that you'd want for an Iceberg table when using Nessie as
>> the catalog? Would it work to add a SupportsProcedures interface on the
>> Iceberg catalog so you can plug in through there? That sounds like the
>> cleanest way to do what you're talking about, at least to me.
>>
>> For modifying the existing procedures, we could look at plugging in
>> through the Iceberg catalog rather than directly in the Spark catalog as
>> well.
>>
>> On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote:
>>
>>> Hey Ryan,
>>>
>>> Thanks for the follow up. As I see it the use cases are as follows:
>>>
>>> 1) add more iceberg specific procedures to the existing catalog. This is
>>> when we are actually operating on iceberg tables (not "to expose custom
>>> stored procedures for non-Iceberg systems")
>>> 2) modify existing OSS iceberg procedures. eg custom versions of
>>> compaction or a migrate table procedure that also re-writes the data
>>> 3) disable existing OSS iceberg procedures. eg regular users shouldn't
>>> be running the snapshot action. Or ensuring that regular users with write
>>> permissions can't also execute potentially destructive actions (eg expire)
>>>
>>> For use cases two or three I see (today) no other option than to extend
>>> org.apache.iceberg.spark.SparkCatalog and for the first use case we can
>>> either extend SparkCatalog or we can implement a whole new ProcedureCatalog
>>> as you suggested above. I think both options are rather counter-intuitive
>>> for users as they have to register extra or different catalogs and just
>>> following our(icebergs) online documentation isn't going to work for them.
>>> Additionally, following the online documentation still leads them to
>>> potentially executing the wrong procedures or destructive procedures.
>>>
>>> I have raised [1] to discuss concrete ways to address this.
>>>
>>> Best,
>>> Ryan
>>>
>>> [1] https://github.com/apache/iceberg/pull/3579
>>>
>>> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I don’t see the code where the spark extensions can find other
>>>> procedure catalogs w/o the user having to configure and reference another
>>>> catalog.
>>>>
>>>> Yes, that’s right. If other systems want to add stored procedures, then
>>>> they would need to add a catalog. Is there a strong use case around adding
>>>> more procedures to Iceberg’s catalogs themselves? Maybe I just don’t
>>>> understand your use case?
>>>>
>>>> I don’t see the problem with registering a catalog to expose custom
>>>> stored procedures for non-Iceberg systems. But let’s start talking about a
>>>> specific case that you have in mind.
>>>>
>>>> Maybe some examples of using ProcedureCatalog by an external project by
>>>> just having dependency on Iceberg jars
>>>>
>>>>
>>>>    1. Implement ProcedureCatalog
>>>>    2. Configure a Spark catalog that uses your ProcedureCatalog
>>>>    implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog)
>>>>    3. Call procedures using that catalog: CALL pcat.some_function()
>>>>
>>>>
>>>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote:
>>>>
>>>>> Thanks Ryan for the response.
>>>>>
>>>>> Maybe I am misunderstanding here, apologies for that. However, I don't
>>>>> see the code where the spark extensions can find other procedure catalogs
>>>>> w/o the user having to configure and reference another catalog.
>>>>>
>>>>> Thinking about it more I think the goal of this discussion is to find
>>>>> a way for operators and vendors to expose a set of procedures to end users
>>>>> w/o resorting to special end user config or forks. Currently there is no
>>>>> way to turn off, replace or add to the set of procedures currently shipped
>>>>> with the iceberg runtime jar. Some of the use cases I envision here are:
>>>>> users w/ permission to append to a table but shouldn't be running
>>>>> maintenance procedures or a custom compaction job rather than the one
>>>>> shipped in iceberg. The only option as I see it is add new
>>>>> ProcedureCatalogs and hope end users don't run the existing procedures 
>>>>> that
>>>>> are shipped with the catalog they are already using to read/write data.
>>>>>
>>>>> Best,
>>>>> Ryan
>>>>>
>>>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I think there's a bit of a misunderstanding here. You shouldn't need
>>>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The 
>>>>>> Iceberg
>>>>>> Spark extensions should support stored procedures exposed by any catalog
>>>>>> plugin that implements `ProcedureCatalog` across the Spark versions where
>>>>>> Iceberg has stored procedures. Since the API should be nearly the same, 
>>>>>> it
>>>>>> will be easy to update when Spark supports `ProcedureCatalog` directly.
>>>>>>
>>>>>> That translates to less code for Iceberg to manage and no long-term
>>>>>> debt supporting procedures plugged in through Iceberg instead of through 
>>>>>> a
>>>>>> Spark interface.
>>>>>>
>>>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Ryan,
>>>>>>>
>>>>>>> What is the timeline for ProcedureCatalog to be moved into Spark and
>>>>>>> will it be backported? I agree 100% that its the 'correct' way to go 
>>>>>>> long
>>>>>>> term but currently Iceberg has a `static final Map`[1] of valid 
>>>>>>> procedures
>>>>>>> and no way for users to customize that. I personally don't love a static
>>>>>>> map regardless of wanting to register custom procedures, it's too error
>>>>>>> prone for maintainers and procedure developers (especially now that 
>>>>>>> there
>>>>>>> are 3 versions of it in the codebase). Additionally, asking teams to 
>>>>>>> extend
>>>>>>> SparkCatalog, with all the associated Spark config changes for end 
>>>>>>> users,
>>>>>>> just to add custom procedures seems a bit heavy handed compared to the
>>>>>>> relatively small change to add a registration mechanism to the existing
>>>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and 
>>>>>>> whatever
>>>>>>> future spark versions before the Procedure Catalog is upstreamed).
>>>>>>>
>>>>>>> There appears already to be a private fork of iceberg using
>>>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>>>>>>> comments on [3] and I am wondering if there is a middle ground with a
>>>>>>> registration method in line with Spark and/or a way for iceberg 
>>>>>>> catalogs to
>>>>>>> specify its procedures (though I am not sure how to do so in a 
>>>>>>> cross-engine
>>>>>>> way). My goal here is to avoid having custom implementations of
>>>>>>> SparkCatalog for everyone who may be interested in adding their own
>>>>>>> procedures. What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Ryan
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>>>>>>> [2]
>>>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>>>>>>> [3] https://github.com/apache/iceberg/pull/3367
>>>>>>>
>>>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> I think that probably the best way to handle this use case is to
>>>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what 
>>>>>>>> we
>>>>>>>> want to get upstream into Spark and is a really reasonable (and small)
>>>>>>>> addition to Spark.
>>>>>>>>
>>>>>>>> The problem with adding pluggable procedures to Iceberg is that it
>>>>>>>> is really working around the fact that Spark doesn't support plugging 
>>>>>>>> in
>>>>>>>> procedures yet. This is specific to Spark and we would have to keep it
>>>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't 
>>>>>>>> seem
>>>>>>>> worth the additional complexity in Iceberg, when you can plug in 
>>>>>>>> through
>>>>>>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Community!
>>>>>>>>>
>>>>>>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>>>>>>> really helpful for users to plugin their own spark actions to handle 
>>>>>>>>> their
>>>>>>>>> business logic around Iceberg tables.
>>>>>>>>> So, can we have a mechanism that allows plugging additional
>>>>>>>>> implementations of 
>>>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>>>>>>> *
>>>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just
>>>>>>>>> dropping an additional jar ?
>>>>>>>>>
>>>>>>>>> Without this feature, users can still add their custom procedure
>>>>>>>>> by extending *SparkCatalog* and/or *SparkSessionCatalog* and
>>>>>>>>> override
>>>>>>>>> *loadProcedure. *Which requires users to configure the subclasses
>>>>>>>>> of Spark[Session]Catalog in their Spark configuration. This way it is 
>>>>>>>>> a lot
>>>>>>>>> of work and it is not a clean way to handle this.
>>>>>>>>>
>>>>>>>>> Another option is to add these custom procedures as UDF, but UDF
>>>>>>>>> is meant to be column related. It doesn't make sense to have UDF for 
>>>>>>>>> spark
>>>>>>>>> actions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *So, I want to know what most of you think about having pluggable
>>>>>>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ajantha
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to