Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Ajantha Bhat Mon, 03 Jun 2024 18:56:11 -0700

Hi All,
Please find the proposal link
https://github.com/apache/iceberg/issues/10432


Google doc link is attached in the proposal.
And Thanks Stephen Lin <https://github.com/sxlin> for working on it.

Hope it gives more clarity to take the decisions and how we want to
implement it.

- Ajantha

On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Jack. I actually meant scalar/aggregate/table user defined
> functions. Here are some examples of what I meant in (2):
>
> Hive GenericUDF:
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
> Trino user defined functions:
> https://trino.io/docs/current/develop/functions.html
> Flink user defined functions:
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>
> Probably what you referred to is a variation of (1) where the API is data
> flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is
> also possible in the very long run :)
>
> Thanks,
> Walaa.
>
>
>
>
> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> > (2) Custom code written in imperative function according to a
>> Java/Scala/Python API, etc.
>>
>> I think we could still explore some long term opportunities in this case.
>> Consider you register a Spark temp view as some sort of data frame read,
>> then it could still be resolved to a Spark plan that is representable by an
>> intermediate representation. But I agree this gets very complicated very
>> soon, and just having the case (1) covered would already be a huge step
>> forward.
>>
>> -Jack
>>
>>
>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote:
>>
>>> It's interesting to note that a tabular SQL UDF can be used to build a 
>>> *parameterized
>>> *view.  So, there's definitely a lot in common between UDFs and views.
>>>
>>> Thanks
>>>
>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>>> I think there is a disconnect about what is perceived as a "UDF". There
>>>> are 2 flavors:
>>>>
>>>> (1) Functions that are defined by the user whose definition is a
>>>> composition of other built-in functions/SQL expressions.
>>>> (2) Custom code written in imperative function according to a
>>>> Java/Scala/Python API, etc.
>>>>
>>>> All the examples in Ajantha's references are pretty much from (1) and I
>>>> think those have more analogy to views due to their SQL nature. Agree (2)
>>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases
>>>> are around (1), and may be worth evaluating.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> I guess we'll know more when you post the proposal, but I think this
>>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>>> memory models without having a huge performance penalty.
>>>>>
>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>> (similar to views as shared by the reference links above), the complexity
>>>>> involved will be similar to managing views.
>>>>>
>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>> We will work on publishing the draft spec (inspired by the view spec)
>>>>> this week to facilitate further discussions.
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> > While it would be great to have a common set of functions across
>>>>>> engines, I don't see how that is practical when those engines are
>>>>>> implemented so differently. Plugging in code -- and especially custom
>>>>>> user-supplied code -- seems inherently specialized to me and should be 
>>>>>> part
>>>>>> of the engines' design.
>>>>>>
>>>>>> How is this different from the views? I feel we can say exactly the
>>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect views
>>>>>> implemented. Maybe it sounds like we are trying to draw a line between 
>>>>>> SQL
>>>>>> vs other programming language as "code"? but I think SQL is just another
>>>>>> type of code, and we are already talking about compiling all these
>>>>>> different code dialects to an intermediate representation (using projects
>>>>>> like Coral, Substrait), which will be stored as another type of
>>>>>> representation of Iceberg view. I think the same functionality can be 
>>>>>> used
>>>>>> for UDFs if developed.
>>>>>>
>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>> multi-dialect one like view, and that can allow engines to for example
>>>>>> parse a view SQL, and when a function referenced cannot be resolved, try 
>>>>>> to
>>>>>> seek for a multi-dialect UDF definition.
>>>>>>
>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>> published.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> wrote:
>>>>>>
>>>>>>> UDFs are as engine specific and portable and "non-centralized" as
>>>>>>> views are. The same performance concerns apply to views as well.
>>>>>>> Iceberg should define a common base upon which engines can build, so
>>>>>>> the argument that UDFs aren't practical, because engines are different, 
>>>>>>> is
>>>>>>> probably only a temporary concern.
>>>>>>>
>>>>>>> In the long term, Iceberg should also try to tackle the idea to make
>>>>>>> views portable, which is conceptually not that much different from 
>>>>>>> portable
>>>>>>> UDFs.
>>>>>>>
>>>>>>>
>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having
>>>>>>> UDFs in Iceberg, especially not in this early stage.
>>>>>>>
>>>>>>>
>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>
>>>>>>> Thanks, Ajantha.
>>>>>>>
>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by
>>>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things that 
>>>>>>> are
>>>>>>> centralized, like tables of data. While it would be great to have a 
>>>>>>> common
>>>>>>> set of functions across engines, I don't see how that is practical when
>>>>>>> those engines are implemented so differently. Plugging in code -- and
>>>>>>> especially custom user-supplied code -- seems inherently specialized to 
>>>>>>> me
>>>>>>> and should be part of the engines' design.
>>>>>>>
>>>>>>> I guess we'll know more when you post the proposal, but I think this
>>>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>>>> memory models without having a huge performance penalty.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> This is a discussion to gauge the community interest in storing the
>>>>>>>> Versioned SQL UDFs in Iceberg.
>>>>>>>> We want to propose the spec addition for storing the versioned UDFs
>>>>>>>> in Iceberg (inspired by view spec).
>>>>>>>>
>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>> associated with tables, but they can accept arguments and produce 
>>>>>>>> return
>>>>>>>> values, or even function as inline expressions.
>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark
>>>>>>>> supports SQL UDFs at catalog level [1].
>>>>>>>> But storing them in Iceberg can enable
>>>>>>>> - Versioning of these UDFs.
>>>>>>>> - Interoperability between the engines. Potentially engines can
>>>>>>>> understand the UDFs written by other engines (with the translate 
>>>>>>>> layer).
>>>>>>>>
>>>>>>>> We believe that integrating this feature into Iceberg would be a
>>>>>>>> valuable addition, and we're eager to collaborate with the community to
>>>>>>>> develop a UDF specification.
>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a
>>>>>>>> specification to propose to the community.
>>>>>>>>
>>>>>>>> Let us know your thoughts on this.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> Dremio -
>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>> Snowflake -
>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>> Databricks -
>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>> --
>>>>>>> Robert Stupp
>>>>>>> @snazy
>>>>>>>
>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to