Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Walaa Eldin Moustafa Tue, 28 May 2024 09:52:39 -0700

I think there is a disconnect about what is perceived as a "UDF". There are
2 flavors:


(1) Functions that are defined by the user whose definition is a
composition of other built-in functions/SQL expressions.
(2) Custom code written in imperative function according to a
Java/Scala/Python API, etc.

All the examples in Ajantha's references are pretty much from (1) and I
think those have more analogy to views due to their SQL nature. Agree (2)
is not practical to maintain by Iceberg, but I think Ajantha's use cases
are around (1), and may be worth evaluating.

Thanks,
Walaa.


On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <[email protected]> wrote:

> I guess we'll know more when you post the proposal, but I think this would
>> be a very difficult area to tackle across engines, languages, and memory
>> models without having a huge performance penalty.
>
> Assuming Iceberg initially supports SQL representations of UDFs (similar
> to views as shared by the reference links above), the complexity involved
> will be similar to managing views.
>
> Thanks, Ryan, Robert, and Jack, for your input.
> We will work on publishing the draft spec (inspired by the view spec) this
> week to facilitate further discussions.
>
> - Ajantha
>
> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]> wrote:
>
>> > While it would be great to have a common set of functions across
>> engines, I don't see how that is practical when those engines are
>> implemented so differently. Plugging in code -- and especially custom
>> user-supplied code -- seems inherently specialized to me and should be part
>> of the engines' design.
>>
>> How is this different from the views? I feel we can say exactly the same
>> thing for Iceberg views, but yet we have Iceberg multi-dialect views
>> implemented. Maybe it sounds like we are trying to draw a line between SQL
>> vs other programming language as "code"? but I think SQL is just another
>> type of code, and we are already talking about compiling all these
>> different code dialects to an intermediate representation (using projects
>> like Coral, Substrait), which will be stored as another type of
>> representation of Iceberg view. I think the same functionality can be used
>> for UDFs if developed.
>>
>> I actually hink adding UDF support is a good idea, even just a
>> multi-dialect one like view, and that can allow engines to for example
>> parse a view SQL, and when a function referenced cannot be resolved, try to
>> seek for a multi-dialect UDF definition.
>>
>> I guess we can discuss more when we have the actual proposal published.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>>
>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]> wrote:
>>
>>> UDFs are as engine specific and portable and "non-centralized" as views
>>> are. The same performance concerns apply to views as well.
>>> Iceberg should define a common base upon which engines can build, so the
>>> argument that UDFs aren't practical, because engines are different, is
>>> probably only a temporary concern.
>>>
>>> In the long term, Iceberg should also try to tackle the idea to make
>>> views portable, which is conceptually not that much different from portable
>>> UDFs.
>>>
>>>
>>> PS: I'm not a fan of adding a negative touch to the idea of having UDFs
>>> in Iceberg, especially not in this early stage.
>>>
>>>
>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>
>>> Thanks, Ajantha.
>>>
>>> I'm skeptical about whether it's a good idea to add UDFs tracked by
>>> Iceberg catalogs. I think that Iceberg primarily deals with things that are
>>> centralized, like tables of data. While it would be great to have a common
>>> set of functions across engines, I don't see how that is practical when
>>> those engines are implemented so differently. Plugging in code -- and
>>> especially custom user-supplied code -- seems inherently specialized to me
>>> and should be part of the engines' design.
>>>
>>> I guess we'll know more when you post the proposal, but I think this
>>> would be a very difficult area to tackle across engines, languages, and
>>> memory models without having a huge performance penalty.
>>>
>>> Ryan
>>>
>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> This is a discussion to gauge the community interest in storing the
>>>> Versioned SQL UDFs in Iceberg.
>>>> We want to propose the spec addition for storing the versioned UDFs in
>>>> Iceberg (inspired by view spec).
>>>>
>>>> These UDFs can operate similarly to views in that they are associated
>>>> with tables, but they can accept arguments and produce return values, or
>>>> even function as inline expressions.
>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark
>>>> supports SQL UDFs at catalog level [1].
>>>> But storing them in Iceberg can enable
>>>> - Versioning of these UDFs.
>>>> - Interoperability between the engines. Potentially engines can
>>>> understand the UDFs written by other engines (with the translate layer).
>>>>
>>>> We believe that integrating this feature into Iceberg would be a
>>>> valuable addition, and we're eager to collaborate with the community to
>>>> develop a UDF specification.
>>>> Stephen <[email protected]> has already begun drafting a
>>>> specification to propose to the community.
>>>>
>>>> Let us know your thoughts on this.
>>>>
>>>> [1]
>>>> Dremio -
>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>> Snowflake -
>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>> Databricks -
>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>
>>>> - Ajantha
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>> --
>>> Robert Stupp
>>> @snazy
>>>
>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to