Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Ajantha Bhat Mon, 24 Jun 2024 03:52:30 -0700

Hi everyone,

We've only received one review so far (from Benny).


We would appreciate more eyes on this.

- Ajantha

On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Hi All,
> Please find the proposal link
> https://github.com/apache/iceberg/issues/10432
>
> Google doc link is attached in the proposal.
> And Thanks Stephen Lin <https://github.com/sxlin> for working on it.
>
> Hope it gives more clarity to take the decisions and how we want to
> implement it.
>
> - Ajantha
>
> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Thanks Jack. I actually meant scalar/aggregate/table user defined
>> functions. Here are some examples of what I meant in (2):
>>
>> Hive GenericUDF:
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>> Trino user defined functions:
>> https://trino.io/docs/current/develop/functions.html
>> Flink user defined functions:
>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>
>> Probably what you referred to is a variation of (1) where the API is data
>> flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is
>> also possible in the very long run :)
>>
>> Thanks,
>> Walaa.
>>
>>
>>
>>
>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> > (2) Custom code written in imperative function according to a
>>> Java/Scala/Python API, etc.
>>>
>>> I think we could still explore some long term opportunities in this
>>> case. Consider you register a Spark temp view as some sort of data frame
>>> read, then it could still be resolved to a Spark plan that is representable
>>> by an intermediate representation. But I agree this gets very complicated
>>> very soon, and just having the case (1) covered would already be a huge
>>> step forward.
>>>
>>> -Jack
>>>
>>>
>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote:
>>>
>>>> It's interesting to note that a tabular SQL UDF can be used to build a 
>>>> *parameterized
>>>> *view.  So, there's definitely a lot in common between UDFs and views.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> I think there is a disconnect about what is perceived as a "UDF".
>>>>> There are 2 flavors:
>>>>>
>>>>> (1) Functions that are defined by the user whose definition is a
>>>>> composition of other built-in functions/SQL expressions.
>>>>> (2) Custom code written in imperative function according to a
>>>>> Java/Scala/Python API, etc.
>>>>>
>>>>> All the examples in Ajantha's references are pretty much from (1) and
>>>>> I think those have more analogy to views due to their SQL nature. Agree 
>>>>> (2)
>>>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases
>>>>> are around (1), and may be worth evaluating.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I guess we'll know more when you post the proposal, but I think this
>>>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>>>> memory models without having a huge performance penalty.
>>>>>>
>>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>>> (similar to views as shared by the reference links above), the complexity
>>>>>> involved will be similar to managing views.
>>>>>>
>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>> We will work on publishing the draft spec (inspired by the view spec)
>>>>>> this week to facilitate further discussions.
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> > While it would be great to have a common set of functions across
>>>>>>> engines, I don't see how that is practical when those engines are
>>>>>>> implemented so differently. Plugging in code -- and especially custom
>>>>>>> user-supplied code -- seems inherently specialized to me and should be 
>>>>>>> part
>>>>>>> of the engines' design.
>>>>>>>
>>>>>>> How is this different from the views? I feel we can say exactly the
>>>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect 
>>>>>>> views
>>>>>>> implemented. Maybe it sounds like we are trying to draw a line between 
>>>>>>> SQL
>>>>>>> vs other programming language as "code"? but I think SQL is just another
>>>>>>> type of code, and we are already talking about compiling all these
>>>>>>> different code dialects to an intermediate representation (using 
>>>>>>> projects
>>>>>>> like Coral, Substrait), which will be stored as another type of
>>>>>>> representation of Iceberg view. I think the same functionality can be 
>>>>>>> used
>>>>>>> for UDFs if developed.
>>>>>>>
>>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>>> multi-dialect one like view, and that can allow engines to for example
>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, 
>>>>>>> try to
>>>>>>> seek for a multi-dialect UDF definition.
>>>>>>>
>>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>>> published.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> wrote:
>>>>>>>
>>>>>>>> UDFs are as engine specific and portable and "non-centralized" as
>>>>>>>> views are. The same performance concerns apply to views as well.
>>>>>>>> Iceberg should define a common base upon which engines can build,
>>>>>>>> so the argument that UDFs aren't practical, because engines are 
>>>>>>>> different,
>>>>>>>> is probably only a temporary concern.
>>>>>>>>
>>>>>>>> In the long term, Iceberg should also try to tackle the idea to
>>>>>>>> make views portable, which is conceptually not that much different from
>>>>>>>> portable UDFs.
>>>>>>>>
>>>>>>>>
>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having
>>>>>>>> UDFs in Iceberg, especially not in this early stage.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>
>>>>>>>> Thanks, Ajantha.
>>>>>>>>
>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by
>>>>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things 
>>>>>>>> that are
>>>>>>>> centralized, like tables of data. While it would be great to have a 
>>>>>>>> common
>>>>>>>> set of functions across engines, I don't see how that is practical when
>>>>>>>> those engines are implemented so differently. Plugging in code -- and
>>>>>>>> especially custom user-supplied code -- seems inherently specialized 
>>>>>>>> to me
>>>>>>>> and should be part of the engines' design.
>>>>>>>>
>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>> languages,
>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> This is a discussion to gauge the community interest in storing
>>>>>>>>> the Versioned SQL UDFs in Iceberg.
>>>>>>>>> We want to propose the spec addition for storing the versioned
>>>>>>>>> UDFs in Iceberg (inspired by view spec).
>>>>>>>>>
>>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>>> associated with tables, but they can accept arguments and produce 
>>>>>>>>> return
>>>>>>>>> values, or even function as inline expressions.
>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark
>>>>>>>>> supports SQL UDFs at catalog level [1].
>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>> - Interoperability between the engines. Potentially engines can
>>>>>>>>> understand the UDFs written by other engines (with the translate 
>>>>>>>>> layer).
>>>>>>>>>
>>>>>>>>> We believe that integrating this feature into Iceberg would be a
>>>>>>>>> valuable addition, and we're eager to collaborate with the community 
>>>>>>>>> to
>>>>>>>>> develop a UDF specification.
>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a
>>>>>>>>> specification to propose to the community.
>>>>>>>>>
>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> Dremio -
>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>>> Snowflake -
>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>> Databricks -
>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>
>>>>>>>>> - Ajantha
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>> --
>>>>>>>> Robert Stupp
>>>>>>>> @snazy
>>>>>>>>
>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to