Re: Regarding the support of pluggable procedures in Iceberg

Ryan Blue Thu, 11 Nov 2021 12:11:29 -0800

I think there's a bit of a misunderstanding here. You shouldn't need to
extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg
Spark extensions should support stored procedures exposed by any catalog
plugin that implements `ProcedureCatalog` across the Spark versions where
Iceberg has stored procedures. Since the API should be nearly the same, it
will be easy to update when Spark supports `ProcedureCatalog` directly.


That translates to less code for Iceberg to manage and no long-term debt
supporting procedures plugged in through Iceberg instead of through a Spark
interface.

On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <rym...@gmail.com> wrote:

> Hey Ryan,
>
> What is the timeline for ProcedureCatalog to be moved into Spark and will
> it be backported? I agree 100% that its the 'correct' way to go long term
> but currently Iceberg has a `static final Map`[1] of valid procedures and
> no way for users to customize that. I personally don't love a static map
> regardless of wanting to register custom procedures, it's too error prone
> for maintainers and procedure developers (especially now that there are 3
> versions of it in the codebase). Additionally, asking teams to extend
> SparkCatalog, with all the associated Spark config changes for end users,
> just to add custom procedures seems a bit heavy handed compared to the
> relatively small change to add a registration mechanism to the existing
> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever
> future spark versions before the Procedure Catalog is upstreamed).
>
> There appears already to be a private fork of iceberg using
> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
> comments on [3] and I am wondering if there is a middle ground with a
> registration method in line with Spark and/or a way for iceberg catalogs to
> specify its procedures (though I am not sure how to do so in a cross-engine
> way). My goal here is to avoid having custom implementations of
> SparkCatalog for everyone who may be interested in adding their own
> procedures. What do you think?
>
> Best,
> Ryan
>
> [1]
> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
> [2] https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
> [3] https://github.com/apache/iceberg/pull/3367
>
> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <b...@tabular.io> wrote:
>
>> I think that probably the best way to handle this use case is to have
>> people implement the Iceberg `ProcedureCatalog` API. That's what we want to
>> get upstream into Spark and is a really reasonable (and small) addition to
>> Spark.
>>
>> The problem with adding pluggable procedures to Iceberg is that it is
>> really working around the fact that Spark doesn't support plugging in
>> procedures yet. This is specific to Spark and we would have to keep it
>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem
>> worth the additional complexity in Iceberg, when you can plug in through
>> the API intended to be Spark's own plugin API, if that makes sense.
>>
>> Ryan
>>
>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Hi Community!
>>>
>>> If Iceberg provides a capability to plugin procedures, it will be really
>>> helpful for users to plugin their own spark actions to handle their
>>> business logic around Iceberg tables.
>>> So, can we have a mechanism that allows plugging additional
>>> implementations of *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>> *
>>> for all users of SparkCatalog and SparkSessionCatalog by just dropping
>>> an additional jar ?
>>>
>>> Without this feature, users can still add their custom procedure by
>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override
>>> *loadProcedure. *Which requires users to configure the subclasses of
>>> Spark[Session]Catalog in their Spark configuration. This way it is a lot of
>>> work and it is not a clean way to handle this.
>>>
>>> Another option is to add these custom procedures as UDF, but UDF is
>>> meant to be column related. It doesn't make sense to have UDF for spark
>>> actions.
>>>
>>>
>>> *So, I want to know what most of you think about having pluggable
>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>
>>> Thanks,
>>> Ajantha
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to