I like the idea of this a lot, I’ve seen a bunch of hacks at companies to make global functions within the company and this seems like a much better way of doing it.
For the requirements option, would it make sense to try and install them dynamically? (Fail fast seems like the way to start though). Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Wed, Jan 28, 2026 at 1:12 PM Szehon Ho <[email protected]> wrote: > This sounds useful, especially with Iceberg proposals like versioned SQL > UDF's. On the surface it sounds like we could extend DSV2 FunctionCatalog > (which as you point out lacks dynamic create/drop function today), but I > may not know some details. Would like to hear opinion of others too who > have worked more on functions/UDF's. > > Thanks! > Szehon > > On Wed, Jan 7, 2026 at 9:32 PM huaxin gao <[email protected]> wrote: > >> Hi Wenchen, >> >> Great question. In the SPIP, the language runtime is carried in the >> function spec (for python / python-pandas) so catalogs can optionally >> declare constraints on the execution environment. >> >> Concretely, the spec can include optional fields like: >> >> - >> >> pythonVersion (e.g., "3.10") >> - >> >> requirements (pip-style specs) >> - >> >> environmentUri (optional pointer to a pre-built / admin-approved >> environment) >> >> For the initial stage, we assume execution uses the existing PySpark >> worker environment (same as regular Python UDF / pandas UDF). If >> pythonVersion / requirements are present, Spark can validate them >> against the current worker env and fail fast (AnalysisException) if they’re >> not satisfied. >> >> environmentUri is intended as an extension point for future integration >> (or vendor plugins) to select a vetted environment, but we don’t assume >> Spark will provision environments out-of-the-box in v1. >> >> Thanks, >> >> Huaxin >> >> On Wed, Jan 7, 2026 at 6:06 PM Wenchen Fan <[email protected]> wrote: >> >>> This is a great feature! How do we define the language runtime? e.g. the >>> Python version and libraries. Do we assume the Python runtime is the same >>> as the PySpark worker? >>> >>> On Thu, Jan 8, 2026 at 3:12 AM huaxin gao <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> >>>> I’d like to start a discussion on a draft SPIP >>>> <https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3> >>>> : >>>> >>>> *SPIP: Catalog-backed Code-Literal Functions (SQL and Python) with >>>> Catalog SPI and CRUD* >>>> >>>> *Problem:* Spark can’t load SQL/Python function bodies from external >>>> catalogs in a standard way today, so users rely on session registration or >>>> vendor extensions. >>>> >>>> *Proposal:* >>>> >>>> - >>>> >>>> Add CodeLiteralFunctionCatalog (Java SPI) returning CodeFunctionSpec >>>> with implementations (spark-sql, python, python-pandas). >>>> - >>>> >>>> Resolution: >>>> - >>>> >>>> SQL: parse + inline (deterministic ⇒ foldable). >>>> - >>>> >>>> Python/pandas: run via existing Python UDF / pandas UDF runtime >>>> (opaque). >>>> - >>>> >>>> SQL TVF: parse to plan, substitute params, validate schema. >>>> - >>>> >>>> DDL: CREATE/REPLACE/DROP FUNCTION delegates to the catalog if it >>>> implements the SPI; otherwise fall back. >>>> >>>> *Precedence + defaults:* >>>> >>>> - >>>> >>>> Unqualified: temp/session > built-in/DSv2 > code-literal (current >>>> catalog). Qualified names resolve only in the named catalog. >>>> - >>>> >>>> Defaults: feature on, SQL on, Python/pandas off; optional >>>> languagePreference. >>>> >>>> Feedbacks are welcomed! >>>> >>>> Thanks, >>>> >>>> Huaxin >>>> >>>
