Re: Regarding the support of pluggable procedures in Iceberg

Ryan Murray Mon, 29 Nov 2021 11:11:29 -0800

Ive commented on the PR with a possible way forward. Its a bit weird
because of the reliance on Spark imports, I would prefer to return names or
Class objects rather than the actual `Procedure` from the iceberg Catalog.


Let me know what you think!

Best,
Ryan

On Mon, Nov 29, 2021 at 2:04 PM Ryan Blue <b...@tabular.io> wrote:

> Yeah, that sounds about right. I'm not sure how we would want to do it
> exactly, but I think the catalog would be able to override the procedures
> it wants to.
>
> On Fri, Nov 26, 2021 at 9:44 AM Ryan Murray <rym...@gmail.com> wrote:
>
>> Hey Ryan,
>>
>> Thanks for the suggestion. That makes a lot of sense. The immediate use
>> case for Nessie is to supply a nessie-ified version of the expire snapshots
>> action which considers all branches. tbh this is something that likely can
>> be (partially) merged to iceberg once the branching/tagging work is further
>> along but I don't think it makes sense to couple the procedure stuff too
>> strongly to the branching proposal yet.
>>
>> I will amend my PR above to reflect this suggestion. I guess you are
>> thinking the SparkCatalog would ask the icebergCatalog if it implements
>> SupportsProcedures and if it does will look up a procedure there before
>> going to the static collection?
>>
>> Best,
>> Ryan
>>
>> On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Thanks for the additional context. #1 makes more sense now. What are you
>>> thinking about exposing that isn't in the standard set of procedures? Are
>>> there procedures that you'd want for an Iceberg table when using Nessie as
>>> the catalog? Would it work to add a SupportsProcedures interface on the
>>> Iceberg catalog so you can plug in through there? That sounds like the
>>> cleanest way to do what you're talking about, at least to me.
>>>
>>> For modifying the existing procedures, we could look at plugging in
>>> through the Iceberg catalog rather than directly in the Spark catalog as
>>> well.
>>>
>>> On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote:
>>>
>>>> Hey Ryan,
>>>>
>>>> Thanks for the follow up. As I see it the use cases are as follows:
>>>>
>>>> 1) add more iceberg specific procedures to the existing catalog. This
>>>> is when we are actually operating on iceberg tables (not "to expose custom
>>>> stored procedures for non-Iceberg systems")
>>>> 2) modify existing OSS iceberg procedures. eg custom versions of
>>>> compaction or a migrate table procedure that also re-writes the data
>>>> 3) disable existing OSS iceberg procedures. eg regular users shouldn't
>>>> be running the snapshot action. Or ensuring that regular users with write
>>>> permissions can't also execute potentially destructive actions (eg expire)
>>>>
>>>> For use cases two or three I see (today) no other option than to extend
>>>> org.apache.iceberg.spark.SparkCatalog and for the first use case we can
>>>> either extend SparkCatalog or we can implement a whole new ProcedureCatalog
>>>> as you suggested above. I think both options are rather counter-intuitive
>>>> for users as they have to register extra or different catalogs and just
>>>> following our(icebergs) online documentation isn't going to work for them.
>>>> Additionally, following the online documentation still leads them to
>>>> potentially executing the wrong procedures or destructive procedures.
>>>>
>>>> I have raised [1] to discuss concrete ways to address this.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/3579
>>>>
>>>> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I don’t see the code where the spark extensions can find other
>>>>> procedure catalogs w/o the user having to configure and reference another
>>>>> catalog.
>>>>>
>>>>> Yes, that’s right. If other systems want to add stored procedures,
>>>>> then they would need to add a catalog. Is there a strong use case around
>>>>> adding more procedures to Iceberg’s catalogs themselves? Maybe I just 
>>>>> don’t
>>>>> understand your use case?
>>>>>
>>>>> I don’t see the problem with registering a catalog to expose custom
>>>>> stored procedures for non-Iceberg systems. But let’s start talking about a
>>>>> specific case that you have in mind.
>>>>>
>>>>> Maybe some examples of using ProcedureCatalog by an external project
>>>>> by just having dependency on Iceberg jars
>>>>>
>>>>>
>>>>>    1. Implement ProcedureCatalog
>>>>>    2. Configure a Spark catalog that uses your ProcedureCatalog
>>>>>    implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog)
>>>>>    3. Call procedures using that catalog: CALL pcat.some_function()
>>>>>
>>>>>
>>>>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Ryan for the response.
>>>>>>
>>>>>> Maybe I am misunderstanding here, apologies for that. However, I
>>>>>> don't see the code where the spark extensions can find other procedure
>>>>>> catalogs w/o the user having to configure and reference another catalog.
>>>>>>
>>>>>> Thinking about it more I think the goal of this discussion is to find
>>>>>> a way for operators and vendors to expose a set of procedures to end 
>>>>>> users
>>>>>> w/o resorting to special end user config or forks. Currently there is no
>>>>>> way to turn off, replace or add to the set of procedures currently 
>>>>>> shipped
>>>>>> with the iceberg runtime jar. Some of the use cases I envision here are:
>>>>>> users w/ permission to append to a table but shouldn't be running
>>>>>> maintenance procedures or a custom compaction job rather than the one
>>>>>> shipped in iceberg. The only option as I see it is add new
>>>>>> ProcedureCatalogs and hope end users don't run the existing procedures 
>>>>>> that
>>>>>> are shipped with the catalog they are already using to read/write data.
>>>>>>
>>>>>> Best,
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> I think there's a bit of a misunderstanding here. You shouldn't need
>>>>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The 
>>>>>>> Iceberg
>>>>>>> Spark extensions should support stored procedures exposed by any catalog
>>>>>>> plugin that implements `ProcedureCatalog` across the Spark versions 
>>>>>>> where
>>>>>>> Iceberg has stored procedures. Since the API should be nearly the same, 
>>>>>>> it
>>>>>>> will be easy to update when Spark supports `ProcedureCatalog` directly.
>>>>>>>
>>>>>>> That translates to less code for Iceberg to manage and no long-term
>>>>>>> debt supporting procedures plugged in through Iceberg instead of 
>>>>>>> through a
>>>>>>> Spark interface.
>>>>>>>
>>>>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Ryan,
>>>>>>>>
>>>>>>>> What is the timeline for ProcedureCatalog to be moved into Spark
>>>>>>>> and will it be backported? I agree 100% that its the 'correct' way to 
>>>>>>>> go
>>>>>>>> long term but currently Iceberg has a `static final Map`[1] of valid
>>>>>>>> procedures and no way for users to customize that. I personally don't 
>>>>>>>> love
>>>>>>>> a static map regardless of wanting to register custom procedures, it's 
>>>>>>>> too
>>>>>>>> error prone for maintainers and procedure developers (especially now 
>>>>>>>> that
>>>>>>>> there are 3 versions of it in the codebase). Additionally, asking 
>>>>>>>> teams to
>>>>>>>> extend SparkCatalog, with all the associated Spark config changes for 
>>>>>>>> end
>>>>>>>> users, just to add custom procedures seems a bit heavy handed compared 
>>>>>>>> to
>>>>>>>> the relatively small change to add a registration mechanism to the 
>>>>>>>> existing
>>>>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and 
>>>>>>>> whatever
>>>>>>>> future spark versions before the Procedure Catalog is upstreamed).
>>>>>>>>
>>>>>>>> There appears already to be a private fork of iceberg using
>>>>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>>>>>>>> comments on [3] and I am wondering if there is a middle ground with a
>>>>>>>> registration method in line with Spark and/or a way for iceberg 
>>>>>>>> catalogs to
>>>>>>>> specify its procedures (though I am not sure how to do so in a 
>>>>>>>> cross-engine
>>>>>>>> way). My goal here is to avoid having custom implementations of
>>>>>>>> SparkCatalog for everyone who may be interested in adding their own
>>>>>>>> procedures. What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>>>>>>>> [2]
>>>>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>>>>>>>> [3] https://github.com/apache/iceberg/pull/3367
>>>>>>>>
>>>>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> I think that probably the best way to handle this use case is to
>>>>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what 
>>>>>>>>> we
>>>>>>>>> want to get upstream into Spark and is a really reasonable (and small)
>>>>>>>>> addition to Spark.
>>>>>>>>>
>>>>>>>>> The problem with adding pluggable procedures to Iceberg is that it
>>>>>>>>> is really working around the fact that Spark doesn't support plugging 
>>>>>>>>> in
>>>>>>>>> procedures yet. This is specific to Spark and we would have to keep it
>>>>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't 
>>>>>>>>> seem
>>>>>>>>> worth the additional complexity in Iceberg, when you can plug in 
>>>>>>>>> through
>>>>>>>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <
>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Community!
>>>>>>>>>>
>>>>>>>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>>>>>>>> really helpful for users to plugin their own spark actions to handle 
>>>>>>>>>> their
>>>>>>>>>> business logic around Iceberg tables.
>>>>>>>>>> So, can we have a mechanism that allows plugging additional
>>>>>>>>>> implementations of 
>>>>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>>>>>>>> *
>>>>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just
>>>>>>>>>> dropping an additional jar ?
>>>>>>>>>>
>>>>>>>>>> Without this feature, users can still add their custom procedure
>>>>>>>>>> by extending *SparkCatalog* and/or *SparkSessionCatalog* and
>>>>>>>>>> override
>>>>>>>>>> *loadProcedure. *Which requires users to configure the
>>>>>>>>>> subclasses of Spark[Session]Catalog in their Spark configuration. 
>>>>>>>>>> This way
>>>>>>>>>> it is a lot of work and it is not a clean way to handle this.
>>>>>>>>>>
>>>>>>>>>> Another option is to add these custom procedures as UDF, but UDF
>>>>>>>>>> is meant to be column related. It doesn't make sense to have UDF for 
>>>>>>>>>> spark
>>>>>>>>>> actions.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *So, I want to know what most of you think about having pluggable
>>>>>>>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ajantha
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to