Re: Regarding the support of pluggable procedures in Iceberg

Ryan Blue Fri, 19 Nov 2021 08:46:40 -0800

Thanks for the additional context. #1 makes more sense now. What are you
thinking about exposing that isn't in the standard set of procedures? Are
there procedures that you'd want for an Iceberg table when using Nessie as
the catalog? Would it work to add a SupportsProcedures interface on the
Iceberg catalog so you can plug in through there? That sounds like the
cleanest way to do what you're talking about, at least to me.


For modifying the existing procedures, we could look at plugging in through
the Iceberg catalog rather than directly in the Spark catalog as well.

On Fri, Nov 19, 2021 at 12:13 AM Ryan Murray <rym...@gmail.com> wrote:

> Hey Ryan,
>
> Thanks for the follow up. As I see it the use cases are as follows:
>
> 1) add more iceberg specific procedures to the existing catalog. This is
> when we are actually operating on iceberg tables (not "to expose custom
> stored procedures for non-Iceberg systems")
> 2) modify existing OSS iceberg procedures. eg custom versions of
> compaction or a migrate table procedure that also re-writes the data
> 3) disable existing OSS iceberg procedures. eg regular users shouldn't be
> running the snapshot action. Or ensuring that regular users with write
> permissions can't also execute potentially destructive actions (eg expire)
>
> For use cases two or three I see (today) no other option than to extend
> org.apache.iceberg.spark.SparkCatalog and for the first use case we can
> either extend SparkCatalog or we can implement a whole new ProcedureCatalog
> as you suggested above. I think both options are rather counter-intuitive
> for users as they have to register extra or different catalogs and just
> following our(icebergs) online documentation isn't going to work for them.
> Additionally, following the online documentation still leads them to
> potentially executing the wrong procedures or destructive procedures.
>
> I have raised [1] to discuss concrete ways to address this.
>
> Best,
> Ryan
>
> [1] https://github.com/apache/iceberg/pull/3579
>
> On Thu, Nov 18, 2021 at 10:51 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I don’t see the code where the spark extensions can find other procedure
>> catalogs w/o the user having to configure and reference another catalog.
>>
>> Yes, that’s right. If other systems want to add stored procedures, then
>> they would need to add a catalog. Is there a strong use case around adding
>> more procedures to Iceberg’s catalogs themselves? Maybe I just don’t
>> understand your use case?
>>
>> I don’t see the problem with registering a catalog to expose custom
>> stored procedures for non-Iceberg systems. But let’s start talking about a
>> specific case that you have in mind.
>>
>> Maybe some examples of using ProcedureCatalog by an external project by
>> just having dependency on Iceberg jars
>>
>>
>>    1. Implement ProcedureCatalog
>>    2. Configure a Spark catalog that uses your ProcedureCatalog
>>    implementation (spark.sql.catalog.pcat = com.company.ProcedureCatalog)
>>    3. Call procedures using that catalog: CALL pcat.some_function()
>>
>>
>> On Fri, Nov 12, 2021 at 3:20 AM Ryan Murray <rym...@gmail.com> wrote:
>>
>>> Thanks Ryan for the response.
>>>
>>> Maybe I am misunderstanding here, apologies for that. However, I don't
>>> see the code where the spark extensions can find other procedure catalogs
>>> w/o the user having to configure and reference another catalog.
>>>
>>> Thinking about it more I think the goal of this discussion is to find a
>>> way for operators and vendors to expose a set of procedures to end users
>>> w/o resorting to special end user config or forks. Currently there is no
>>> way to turn off, replace or add to the set of procedures currently shipped
>>> with the iceberg runtime jar. Some of the use cases I envision here are:
>>> users w/ permission to append to a table but shouldn't be running
>>> maintenance procedures or a custom compaction job rather than the one
>>> shipped in iceberg. The only option as I see it is add new
>>> ProcedureCatalogs and hope end users don't run the existing procedures that
>>> are shipped with the catalog they are already using to read/write data.
>>>
>>> Best,
>>> Ryan
>>>
>>> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I think there's a bit of a misunderstanding here. You shouldn't need to
>>>> extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg
>>>> Spark extensions should support stored procedures exposed by any catalog
>>>> plugin that implements `ProcedureCatalog` across the Spark versions where
>>>> Iceberg has stored procedures. Since the API should be nearly the same, it
>>>> will be easy to update when Spark supports `ProcedureCatalog` directly.
>>>>
>>>> That translates to less code for Iceberg to manage and no long-term
>>>> debt supporting procedures plugged in through Iceberg instead of through a
>>>> Spark interface.
>>>>
>>>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote:
>>>>
>>>>> Hey Ryan,
>>>>>
>>>>> What is the timeline for ProcedureCatalog to be moved into Spark and
>>>>> will it be backported? I agree 100% that its the 'correct' way to go long
>>>>> term but currently Iceberg has a `static final Map`[1] of valid procedures
>>>>> and no way for users to customize that. I personally don't love a static
>>>>> map regardless of wanting to register custom procedures, it's too error
>>>>> prone for maintainers and procedure developers (especially now that there
>>>>> are 3 versions of it in the codebase). Additionally, asking teams to 
>>>>> extend
>>>>> SparkCatalog, with all the associated Spark config changes for end users,
>>>>> just to add custom procedures seems a bit heavy handed compared to the
>>>>> relatively small change to add a registration mechanism to the existing
>>>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever
>>>>> future spark versions before the Procedure Catalog is upstreamed).
>>>>>
>>>>> There appears already to be a private fork of iceberg using
>>>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>>>>> comments on [3] and I am wondering if there is a middle ground with a
>>>>> registration method in line with Spark and/or a way for iceberg catalogs 
>>>>> to
>>>>> specify its procedures (though I am not sure how to do so in a 
>>>>> cross-engine
>>>>> way). My goal here is to avoid having custom implementations of
>>>>> SparkCatalog for everyone who may be interested in adding their own
>>>>> procedures. What do you think?
>>>>>
>>>>> Best,
>>>>> Ryan
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>>>>> [2]
>>>>> https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>>>>> [3] https://github.com/apache/iceberg/pull/3367
>>>>>
>>>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I think that probably the best way to handle this use case is to have
>>>>>> people implement the Iceberg `ProcedureCatalog` API. That's what we want 
>>>>>> to
>>>>>> get upstream into Spark and is a really reasonable (and small) addition 
>>>>>> to
>>>>>> Spark.
>>>>>>
>>>>>> The problem with adding pluggable procedures to Iceberg is that it is
>>>>>> really working around the fact that Spark doesn't support plugging in
>>>>>> procedures yet. This is specific to Spark and we would have to keep it
>>>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem
>>>>>> worth the additional complexity in Iceberg, when you can plug in through
>>>>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Community!
>>>>>>>
>>>>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>>>>> really helpful for users to plugin their own spark actions to handle 
>>>>>>> their
>>>>>>> business logic around Iceberg tables.
>>>>>>> So, can we have a mechanism that allows plugging additional
>>>>>>> implementations of 
>>>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>>>>> *
>>>>>>> for all users of SparkCatalog and SparkSessionCatalog by just
>>>>>>> dropping an additional jar ?
>>>>>>>
>>>>>>> Without this feature, users can still add their custom procedure by
>>>>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override
>>>>>>> *loadProcedure. *Which requires users to configure the subclasses
>>>>>>> of Spark[Session]Catalog in their Spark configuration. This way it is a 
>>>>>>> lot
>>>>>>> of work and it is not a clean way to handle this.
>>>>>>>
>>>>>>> Another option is to add these custom procedures as UDF, but UDF is
>>>>>>> meant to be column related. It doesn't make sense to have UDF for spark
>>>>>>> actions.
>>>>>>>
>>>>>>>
>>>>>>> *So, I want to know what most of you think about having pluggable
>>>>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ajantha
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to