Re: Regarding the support of pluggable procedures in Iceberg

Ajantha Bhat Sun, 14 Nov 2021 20:48:41 -0800

Hi Ryan Blue and Ryan Murray,


*Thanks for giving your inputs. But I think we still need to conclude on
this.*
@Ryan Blue:

> You shouldn't need to extend Iceberg's SparkCatalog to plug in stored
> procedures. The Iceberg Spark extensions should support stored procedures
> exposed by any catalog plugin that implements `ProcedureCatalog` across the
> Spark versions where Iceberg has stored procedures

*ProcedureCatalog::loadProcedure*, uses *ProcedureBuilder* which still
needs a catalog.
So, how to achieve it without extending the catalog? Maybe some examples of
using *ProcedureCatalog by an external project by just having dependency on
Iceberg jars* will be useful in clearing my doubts I guess.

Thanks,
Ajantha

On Fri, Nov 12, 2021 at 4:50 PM Ryan Murray <rym...@gmail.com> wrote:

> Thanks Ryan for the response.
>
> Maybe I am misunderstanding here, apologies for that. However, I don't see
> the code where the spark extensions can find other procedure catalogs w/o
> the user having to configure and reference another catalog.
>
> Thinking about it more I think the goal of this discussion is to find a
> way for operators and vendors to expose a set of procedures to end users
> w/o resorting to special end user config or forks. Currently there is no
> way to turn off, replace or add to the set of procedures currently shipped
> with the iceberg runtime jar. Some of the use cases I envision here are:
> users w/ permission to append to a table but shouldn't be running
> maintenance procedures or a custom compaction job rather than the one
> shipped in iceberg. The only option as I see it is add new
> ProcedureCatalogs and hope end users don't run the existing procedures that
> are shipped with the catalog they are already using to read/write data.
>
> Best,
> Ryan
>
> On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I think there's a bit of a misunderstanding here. You shouldn't need to
>> extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg
>> Spark extensions should support stored procedures exposed by any catalog
>> plugin that implements `ProcedureCatalog` across the Spark versions where
>> Iceberg has stored procedures. Since the API should be nearly the same, it
>> will be easy to update when Spark supports `ProcedureCatalog` directly.
>>
>> That translates to less code for Iceberg to manage and no long-term debt
>> supporting procedures plugged in through Iceberg instead of through a Spark
>> interface.
>>
>> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote:
>>
>>> Hey Ryan,
>>>
>>> What is the timeline for ProcedureCatalog to be moved into Spark and
>>> will it be backported? I agree 100% that its the 'correct' way to go long
>>> term but currently Iceberg has a `static final Map`[1] of valid procedures
>>> and no way for users to customize that. I personally don't love a static
>>> map regardless of wanting to register custom procedures, it's too error
>>> prone for maintainers and procedure developers (especially now that there
>>> are 3 versions of it in the codebase). Additionally, asking teams to extend
>>> SparkCatalog, with all the associated Spark config changes for end users,
>>> just to add custom procedures seems a bit heavy handed compared to the
>>> relatively small change to add a registration mechanism to the existing
>>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever
>>> future spark versions before the Procedure Catalog is upstreamed).
>>>
>>> There appears already to be a private fork of iceberg using
>>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>>> comments on [3] and I am wondering if there is a middle ground with a
>>> registration method in line with Spark and/or a way for iceberg catalogs to
>>> specify its procedures (though I am not sure how to do so in a cross-engine
>>> way). My goal here is to avoid having custom implementations of
>>> SparkCatalog for everyone who may be interested in adding their own
>>> procedures. What do you think?
>>>
>>> Best,
>>> Ryan
>>>
>>> [1]
>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>>> [2] https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>>> [3] https://github.com/apache/iceberg/pull/3367
>>>
>>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> I think that probably the best way to handle this use case is to have
>>>> people implement the Iceberg `ProcedureCatalog` API. That's what we want to
>>>> get upstream into Spark and is a really reasonable (and small) addition to
>>>> Spark.
>>>>
>>>> The problem with adding pluggable procedures to Iceberg is that it is
>>>> really working around the fact that Spark doesn't support plugging in
>>>> procedures yet. This is specific to Spark and we would have to keep it
>>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem
>>>> worth the additional complexity in Iceberg, when you can plug in through
>>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Community!
>>>>>
>>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>>> really helpful for users to plugin their own spark actions to handle their
>>>>> business logic around Iceberg tables.
>>>>> So, can we have a mechanism that allows plugging additional
>>>>> implementations of 
>>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>>> *
>>>>> for all users of SparkCatalog and SparkSessionCatalog by just dropping
>>>>> an additional jar ?
>>>>>
>>>>> Without this feature, users can still add their custom procedure by
>>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override
>>>>> *loadProcedure. *Which requires users to configure the subclasses of
>>>>> Spark[Session]Catalog in their Spark configuration. This way it is a lot 
>>>>> of
>>>>> work and it is not a clean way to handle this.
>>>>>
>>>>> Another option is to add these custom procedures as UDF, but UDF is
>>>>> meant to be column related. It doesn't make sense to have UDF for spark
>>>>> actions.
>>>>>
>>>>>
>>>>> *So, I want to know what most of you think about having pluggable
>>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>>
>>>>> Thanks,
>>>>> Ajantha
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to