Re: Regarding the support of pluggable procedures in Iceberg

Ryan Murray Fri, 12 Nov 2021 03:20:19 -0800

Thanks Ryan for the response.

Maybe I am misunderstanding here, apologies for that. However, I don't see
the code where the spark extensions can find other procedure catalogs w/o
the user having to configure and reference another catalog.


Thinking about it more I think the goal of this discussion is to find a way
for operators and vendors to expose a set of procedures to end users w/o
resorting to special end user config or forks. Currently there is no way to
turn off, replace or add to the set of procedures currently shipped with
the iceberg runtime jar. Some of the use cases I envision here are: users
w/ permission to append to a table but shouldn't be running maintenance
procedures or a custom compaction job rather than the one shipped in
iceberg. The only option as I see it is add new ProcedureCatalogs and hope
end users don't run the existing procedures that are shipped with the
catalog they are already using to read/write data.

Best,
Ryan

On Thu, Nov 11, 2021 at 9:10 PM Ryan Blue <[email protected]> wrote:

> I think there's a bit of a misunderstanding here. You shouldn't need to
> extend Iceberg's SparkCatalog to plug in stored procedures. The Iceberg
> Spark extensions should support stored procedures exposed by any catalog
> plugin that implements `ProcedureCatalog` across the Spark versions where
> Iceberg has stored procedures. Since the API should be nearly the same, it
> will be easy to update when Spark supports `ProcedureCatalog` directly.
>
> That translates to less code for Iceberg to manage and no long-term debt
> supporting procedures plugged in through Iceberg instead of through a Spark
> interface.
>
> On Thu, Nov 11, 2021 at 8:08 AM Ryan Murray <[email protected]> wrote:
>
>> Hey Ryan,
>>
>> What is the timeline for ProcedureCatalog to be moved into Spark and will
>> it be backported? I agree 100% that its the 'correct' way to go long term
>> but currently Iceberg has a `static final Map`[1] of valid procedures and
>> no way for users to customize that. I personally don't love a static map
>> regardless of wanting to register custom procedures, it's too error prone
>> for maintainers and procedure developers (especially now that there are 3
>> versions of it in the codebase). Additionally, asking teams to extend
>> SparkCatalog, with all the associated Spark config changes for end users,
>> just to add custom procedures seems a bit heavy handed compared to the
>> relatively small change to add a registration mechanism to the existing
>> ProcedureCatalog. This also unblocks teams using <=Spark3.2 (and whatever
>> future spark versions before the Procedure Catalog is upstreamed).
>>
>> There appears already to be a private fork of iceberg using
>> ServiceLoader[2] and a draft PR using similar[3]. I agree with your
>> comments on [3] and I am wondering if there is a middle ground with a
>> registration method in line with Spark and/or a way for iceberg catalogs to
>> specify its procedures (though I am not sure how to do so in a cross-engine
>> way). My goal here is to avoid having custom implementations of
>> SparkCatalog for everyone who may be interested in adding their own
>> procedures. What do you think?
>>
>> Best,
>> Ryan
>>
>> [1]
>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java#L29
>> [2] https://github.com/apache/iceberg/issues/3254#issuecomment-943845848
>> [3] https://github.com/apache/iceberg/pull/3367
>>
>> On Thu, Nov 11, 2021 at 2:03 AM Ryan Blue <[email protected]> wrote:
>>
>>> I think that probably the best way to handle this use case is to have
>>> people implement the Iceberg `ProcedureCatalog` API. That's what we want to
>>> get upstream into Spark and is a really reasonable (and small) addition to
>>> Spark.
>>>
>>> The problem with adding pluggable procedures to Iceberg is that it is
>>> really working around the fact that Spark doesn't support plugging in
>>> procedures yet. This is specific to Spark and we would have to keep it
>>> alive well past when we get `ProcedureCatalog` upstream. It doesn't seem
>>> worth the additional complexity in Iceberg, when you can plug in through
>>> the API intended to be Spark's own plugin API, if that makes sense.
>>>
>>> Ryan
>>>
>>> On Wed, Nov 10, 2021 at 6:54 AM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Hi Community!
>>>>
>>>> If Iceberg provides a capability to plugin procedures, it will be
>>>> really helpful for users to plugin their own spark actions to handle their
>>>> business logic around Iceberg tables.
>>>> So, can we have a mechanism that allows plugging additional
>>>> implementations of 
>>>> *org.apache.spark.sql.connector.iceberg.catalog.Procedure
>>>> *
>>>> for all users of SparkCatalog and SparkSessionCatalog by just dropping
>>>> an additional jar ?
>>>>
>>>> Without this feature, users can still add their custom procedure by
>>>> extending *SparkCatalog* and/or *SparkSessionCatalog* and override
>>>> *loadProcedure. *Which requires users to configure the subclasses of
>>>> Spark[Session]Catalog in their Spark configuration. This way it is a lot of
>>>> work and it is not a clean way to handle this.
>>>>
>>>> Another option is to add these custom procedures as UDF, but UDF is
>>>> meant to be column related. It doesn't make sense to have UDF for spark
>>>> actions.
>>>>
>>>>
>>>> *So, I want to know what most of you think about having pluggable
>>>> procedures in Iceberg? Does this feature solve your problems too?*
>>>>
>>>> Thanks,
>>>> Ajantha
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Regarding the support of pluggable procedures in Iceberg

Reply via email to