Re: Regarding the support of pluggable procedures in Iceberg

Ryan Murray Fri, 26 Nov 2021 09:44:32 -0800

Hey Ryan,

Thanks for the suggestion. That makes a lot of sense. The immediate use
case for Nessie is to supply a nessie-ified version of the expire snapshots
action which considers all branches. tbh this is something that likely can
be (partially) merged to iceberg once the branching/tagging work is further
along but I don't think it makes sense to couple the procedure stuff too
strongly to the branching proposal yet.


I will amend my PR above to reflect this suggestion. I guess you are
thinking the SparkCatalog would ask the icebergCatalog if it implements
SupportsProcedures and if it does will look up a procedure there before
going to the static collection?

Best,
Ryan

On Fri, Nov 19, 2021 at 11:46 AM Ryan Blue <b...@tabular.io> wrote:

> Thanks for the additional context. #1 makes more sense now. What are you
> thinking about exposing that isn't in the standard set of procedures? Are
> there procedures that you'd want for an Iceberg table when using Nessie as
> the catalog? Would it work to add a SupportsProcedures interface on the
> Iceberg catalog so you can plug in through there? That sounds like the
> cleanest way to do what you're talking about, at least to me.
>
> For modifying the existing procedures, we could look at plugging in
> through the Iceberg catalog rather than directly in the Spark catalog as
> well.
>
> On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote:
>
>> Hey Ryan,
>>
>> Thanks for the follow up. As I see it the use cases are as follows:
>>
>> 1) add more iceberg specific procedures to the existing catalog. This is
>> when we are actually operating on iceberg tables (not "to expose custom
>> stored procedures for non-Iceberg systems")
>> 2) modify existing OSS iceberg procedures. eg custom versions of
>> compaction or a migrate table procedure that also re-writes the data
>> 3) disable existing OSS iceberg procedures. eg regular users shouldn't be
>> running the snapshot action. Or ensuring that regular users with write
>> permissions can't also execute potentially destructive actions (eg expire)
>>
>> For use cases two or three I see (today) no other option than to extend
>> org.apache.iceberg.spark.SparkCatalog and for the first use case we can
>> either extend SparkCatalog or we can implement a whole new ProcedureCatalog
>> as you suggested above. I think both options are rather counter-intuitive
>> for users as they have to register extra or different catalogs and just
>> following our(icebergs) online documentation isn't going to work for them.
>> Additionally, following the online documentation still leads them to
>> potentially executing the wrong procedures or destructive procedures.
>>
>> I have raised [1] to discuss concrete ways to address this.
>>
>> Best,
>> Ryan
>>
>> [1] https://github.com/apache/iceberg/pull/3579
>>
>> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> I don’t see the code where the spark extensions can find other procedure
>>> catalogs w/o the user having to configure and reference another catalog.
>>>
>>> Yes, that’s right. If other systems want to add stored procedures, then
>>> they would need to add a catalog. Is there a strong use case around adding
>>> more procedures to Iceberg’s catalogs themselves? Maybe I just don’t
>>> understand your use case?
>>>
>>> I don’t see the problem with registering a catalog to expose custom
>>> stored procedures for non-Iceberg systems. But let’s start talking about a
>>> specific case that you have in mind.
>>>
>>> Maybe some examples of using ProcedureCatalog by an external project by
>>> just having dependency on Iceberg jars
>>>
>>>
>>>    1. Implement ProcedureCatalog
>>>    2. Configure a Spark catalog that uses your ProcedureCatalog
>>>    implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog)
>>>    3. Call procedures using that catalog: CALL pcat.some_function()
>>>
>>>
>>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote:
>>>
>>>> Thanks Ryan for the response.
>>>>
>>>> Maybe I am misunderstanding here, apologies for that. However, I don't
>>>> see the code where the spark extensions can find other procedure catalogs
>>>> w/o the user having to configure and reference another catalog.
>>>>
>>>> Thinking about it more I think the goal of this discussion is to find a
>>>> way for operators and vendors to expose a set of procedures to end users
>>>> w/o resorting to special end user config or forks. Currently there is no
>>>> way to turn off, replace or add to the set of procedures currently shipped
>>>> with the iceberg runtime jar. Some of the use cases I envision here are:
>>>> users w/ permission to append to a table but shouldn't be running
>>>> maintenance procedures or a custom compaction job rather than the one
>>>> shipped in iceberg. The only option as I see it is add new
>>>> ProcedureCatalogs and hope end users don't run the existing procedures that
>>>> are shipped with the catalog they are already using to read/write data.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I think there's a bit of a misunderstanding here. You shouldn't need
>>>>> to extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg
>>>>> Spark extensions should support stored procedures exposed by any catalog
>>>>> plugin that implements `ProcedureCatalog` across the Spark versions where
>>>>> Iceberg has stored procedures. Since the API should be nearly the same, it
>>>>> will be easy to update when Spark supports `ProcedureCatalog` directly.
>>>>>
>>>>> That translates to less code for Iceberg to manage and no long-term
>>>>> debt supporting procedures plugged in through Iceberg instead of through a
>>>>> Spark interface.
>>>>>
>>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote:
>>>>>
>>>>>> Hey Ryan,
>>>>>>
>>>>>> What is the timeline for ProcedureCatalog to be moved into Spark and
>>>>>> will it be backported? I agree 100% that its the 'correct' way to go long
>>>>>> term but currently Iceberg has a `static final Map`[1] of valid 
>>>>>> procedures
>>>>>> and no way for users to customize that. I personally don't love a static
>>>>>> map regardless of wanting to register custom procedures, it's too error
>>>>>> prone for maintainers and procedure developers (especially now that there
>>>>>> are 3 versions of it in the codebase). Additionally, asking teams to 
>>>>>> extend
>>>>>> SparkCatalog, with all the associated Spark config changes for end users,
>>>>>> just to add custom procedures seems a bit heavy handed compared to the
>>>>>> relatively small change to add a registration mechanism to the existing
>>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever
>>>>>> future spark versions before the Procedure Catalog is upstreamed).
>>>>>>
>>>>>> There appears already to be a private fork of iceberg using
>>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>>>>>> comments on [3] and I am wondering if there is a middle ground with a
>>>>>> registration method in line with Spark and/or a way for iceberg catalogs 
>>>>>> to
>>>>>> specify its procedures (though I am not sure how to do so in a 
>>>>>> cross-engine
>>>>>> way). My goal here is to avoid having custom implementations of
>>>>>> SparkCatalog for everyone who may be interested in adding their own
>>>>>> procedures. What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Ryan
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>>>>>> [2]
>>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>>>>>> [3] https://github.com/apache/iceberg/pull/3367
>>>>>>
>>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> I think that probably the best way to handle this use case is to
>>>>>>> have people implement the Iceberg `ProcedureCatalog` API. That's what we
>>>>>>> want to get upstream into Spark and is a really reasonable (and small)
>>>>>>> addition to Spark.
>>>>>>>
>>>>>>> The problem with adding pluggable procedures to Iceberg is that it
>>>>>>> is really working around the fact that Spark doesn't support plugging in
>>>>>>> procedures yet. This is specific to Spark and we would have to keep it
>>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem
>>>>>>> worth the additional complexity in Iceberg, when you can plug in through
>>>>>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Community!
>>>>>>>>
>>>>>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>>>>>> really helpful for users to plugin their own spark actions to handle 
>>>>>>>> their
>>>>>>>> business logic around Iceberg tables.
>>>>>>>> So, can we have a mechanism that allows plugging additional
>>>>>>>> implementations of 
>>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>>>>>> *
>>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just
>>>>>>>> dropping an additional jar ?
>>>>>>>>
>>>>>>>> Without this feature, users can still add their custom procedure by
>>>>>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override
>>>>>>>> *loadProcedure. *Which requires users to configure the subclasses
>>>>>>>> of Spark[Session]Catalog in their Spark configuration. This way it is 
>>>>>>>> a lot
>>>>>>>> of work and it is not a clean way to handle this.
>>>>>>>>
>>>>>>>> Another option is to add these custom procedures as UDF, but UDF is
>>>>>>>> meant to be column related. It doesn't make sense to have UDF for spark
>>>>>>>> actions.
>>>>>>>>
>>>>>>>>
>>>>>>>> *So, I want to know what most of you think about having pluggable
>>>>>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ajantha
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to