I like the idea of this a lot, I’ve seen a bunch of hacks at companies to
make global functions within the company and this seems like a much better
way of doing it.

For the requirements option, would it make sense to try and install them
dynamically? (Fail fast seems like the way to start though).

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Wed, Jan 28, 2026 at 1:12 PM Szehon Ho <[email protected]> wrote:

> This sounds useful, especially with Iceberg proposals like versioned SQL
> UDF's.  On the surface it sounds like we could extend DSV2 FunctionCatalog
> (which as you point out lacks dynamic create/drop function today), but I
> may not know some details.  Would like to hear opinion of others too who
> have worked more on functions/UDF's.
>
> Thanks!
> Szehon
>
> On Wed, Jan 7, 2026 at 9:32 PM huaxin gao <[email protected]> wrote:
>
>> Hi Wenchen,
>>
>> Great question. In the SPIP, the language runtime is carried in the
>> function spec (for python / python-pandas) so catalogs can optionally
>> declare constraints on the execution environment.
>>
>> Concretely, the spec can include optional fields like:
>>
>>    -
>>
>>    pythonVersion (e.g., "3.10")
>>    -
>>
>>    requirements (pip-style specs)
>>    -
>>
>>    environmentUri (optional pointer to a pre-built / admin-approved
>>    environment)
>>
>> For the initial stage, we assume execution uses the existing PySpark
>> worker environment (same as regular Python UDF / pandas UDF). If
>> pythonVersion / requirements are present, Spark can validate them
>> against the current worker env and fail fast (AnalysisException) if they’re
>> not satisfied.
>>
>> environmentUri is intended as an extension point for future integration
>> (or vendor plugins) to select a vetted environment, but we don’t assume
>> Spark will provision environments out-of-the-box in v1.
>>
>> Thanks,
>>
>> Huaxin
>>
>> On Wed, Jan 7, 2026 at 6:06 PM Wenchen Fan <[email protected]> wrote:
>>
>>> This is a great feature! How do we define the language runtime? e.g. the
>>> Python version and libraries. Do we assume the Python runtime is the same
>>> as the PySpark worker?
>>>
>>> On Thu, Jan 8, 2026 at 3:12 AM huaxin gao <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I’d like to start a discussion on a draft SPIP
>>>> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
>>>> :
>>>>
>>>> *SPIP: Catalog-backed Code-Literal Functions (SQL and Python) with
>>>> Catalog SPI and CRUD*
>>>>
>>>> *Problem:* Spark can’t load SQL/Python function bodies from external
>>>> catalogs in a standard way today, so users rely on session registration or
>>>> vendor extensions.
>>>>
>>>> *Proposal:*
>>>>
>>>>    -
>>>>
>>>>    Add CodeLiteralFunctionCatalog (Java SPI) returning CodeFunctionSpec
>>>>    with implementations (spark-sql, python, python-pandas).
>>>>    -
>>>>
>>>>    Resolution:
>>>>    -
>>>>
>>>>       SQL: parse + inline (deterministic ⇒ foldable).
>>>>       -
>>>>
>>>>       Python/pandas: run via existing Python UDF / pandas UDF runtime
>>>>       (opaque).
>>>>       -
>>>>
>>>>       SQL TVF: parse to plan, substitute params, validate schema.
>>>>       -
>>>>
>>>>    DDL: CREATE/REPLACE/DROP FUNCTION delegates to the catalog if it
>>>>    implements the SPI; otherwise fall back.
>>>>
>>>> *Precedence + defaults:*
>>>>
>>>>    -
>>>>
>>>>    Unqualified: temp/session > built-in/DSv2 > code-literal (current
>>>>    catalog). Qualified names resolve only in the named catalog.
>>>>    -
>>>>
>>>>    Defaults: feature on, SQL on, Python/pandas off; optional
>>>>    languagePreference.
>>>>
>>>> Feedbacks are welcomed!
>>>>
>>>> Thanks,
>>>>
>>>> Huaxin
>>>>
>>>

Reply via email to