Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Ajantha Bhat Mon, 15 Jul 2024 23:09:38 -0700

Hi, just another reminder since we didn't get any review on the proposal.
Initially proposed on June 4.


- Ajantha

On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Hi everyone,
>
> We've only received one review so far (from Benny).
>
> We would appreciate more eyes on this.
>
> - Ajantha
>
> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>
>> Hi All,
>> Please find the proposal link
>> https://github.com/apache/iceberg/issues/10432
>>
>> Google doc link is attached in the proposal.
>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it.
>>
>> Hope it gives more clarity to take the decisions and how we want to
>> implement it.
>>
>> - Ajantha
>>
>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Thanks Jack. I actually meant scalar/aggregate/table user defined
>>> functions. Here are some examples of what I meant in (2):
>>>
>>> Hive GenericUDF:
>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>> Trino user defined functions:
>>> https://trino.io/docs/current/develop/functions.html
>>> Flink user defined functions:
>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>
>>> Probably what you referred to is a variation of (1) where the API is
>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that
>>> is also possible in the very long run :)
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>>
>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> > (2) Custom code written in imperative function according to a
>>>> Java/Scala/Python API, etc.
>>>>
>>>> I think we could still explore some long term opportunities in this
>>>> case. Consider you register a Spark temp view as some sort of data frame
>>>> read, then it could still be resolved to a Spark plan that is representable
>>>> by an intermediate representation. But I agree this gets very complicated
>>>> very soon, and just having the case (1) covered would already be a huge
>>>> step forward.
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote:
>>>>
>>>>> It's interesting to note that a tabular SQL UDF can be used to build a 
>>>>> *parameterized
>>>>> *view.  So, there's definitely a lot in common between UDFs and views.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> I think there is a disconnect about what is perceived as a "UDF".
>>>>>> There are 2 flavors:
>>>>>>
>>>>>> (1) Functions that are defined by the user whose definition is a
>>>>>> composition of other built-in functions/SQL expressions.
>>>>>> (2) Custom code written in imperative function according to a
>>>>>> Java/Scala/Python API, etc.
>>>>>>
>>>>>> All the examples in Ajantha's references are pretty much from (1) and
>>>>>> I think those have more analogy to views due to their SQL nature. Agree 
>>>>>> (2)
>>>>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases
>>>>>> are around (1), and may be worth evaluating.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I guess we'll know more when you post the proposal, but I think this
>>>>>>>> would be a very difficult area to tackle across engines, languages, and
>>>>>>>> memory models without having a huge performance penalty.
>>>>>>>
>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>>>> (similar to views as shared by the reference links above), the 
>>>>>>> complexity
>>>>>>> involved will be similar to managing views.
>>>>>>>
>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>> We will work on publishing the draft spec (inspired by the view
>>>>>>> spec) this week to facilitate further discussions.
>>>>>>>
>>>>>>> - Ajantha
>>>>>>>
>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>>> > While it would be great to have a common set of functions across
>>>>>>>> engines, I don't see how that is practical when those engines are
>>>>>>>> implemented so differently. Plugging in code -- and especially custom
>>>>>>>> user-supplied code -- seems inherently specialized to me and should be 
>>>>>>>> part
>>>>>>>> of the engines' design.
>>>>>>>>
>>>>>>>> How is this different from the views? I feel we can say exactly the
>>>>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect 
>>>>>>>> views
>>>>>>>> implemented. Maybe it sounds like we are trying to draw a line between 
>>>>>>>> SQL
>>>>>>>> vs other programming language as "code"? but I think SQL is just 
>>>>>>>> another
>>>>>>>> type of code, and we are already talking about compiling all these
>>>>>>>> different code dialects to an intermediate representation (using 
>>>>>>>> projects
>>>>>>>> like Coral, Substrait), which will be stored as another type of
>>>>>>>> representation of Iceberg view. I think the same functionality can be 
>>>>>>>> used
>>>>>>>> for UDFs if developed.
>>>>>>>>
>>>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>>>> multi-dialect one like view, and that can allow engines to for example
>>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, 
>>>>>>>> try to
>>>>>>>> seek for a multi-dialect UDF definition.
>>>>>>>>
>>>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>>>> published.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" as
>>>>>>>>> views are. The same performance concerns apply to views as well.
>>>>>>>>> Iceberg should define a common base upon which engines can build,
>>>>>>>>> so the argument that UDFs aren't practical, because engines are 
>>>>>>>>> different,
>>>>>>>>> is probably only a temporary concern.
>>>>>>>>>
>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to
>>>>>>>>> make views portable, which is conceptually not that much different 
>>>>>>>>> from
>>>>>>>>> portable UDFs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having
>>>>>>>>> UDFs in Iceberg, especially not in this early stage.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>
>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>
>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked
>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with things 
>>>>>>>>> that
>>>>>>>>> are centralized, like tables of data. While it would be great to have 
>>>>>>>>> a
>>>>>>>>> common set of functions across engines, I don't see how that is 
>>>>>>>>> practical
>>>>>>>>> when those engines are implemented so differently. Plugging in code 
>>>>>>>>> -- and
>>>>>>>>> especially custom user-supplied code -- seems inherently specialized 
>>>>>>>>> to me
>>>>>>>>> and should be part of the engines' design.
>>>>>>>>>
>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>> languages,
>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> This is a discussion to gauge the community interest in storing
>>>>>>>>>> the Versioned SQL UDFs in Iceberg.
>>>>>>>>>> We want to propose the spec addition for storing the versioned
>>>>>>>>>> UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>
>>>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>>>> associated with tables, but they can accept arguments and produce 
>>>>>>>>>> return
>>>>>>>>>> values, or even function as inline expressions.
>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks
>>>>>>>>>> Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>> - Interoperability between the engines. Potentially engines can
>>>>>>>>>> understand the UDFs written by other engines (with the translate 
>>>>>>>>>> layer).
>>>>>>>>>>
>>>>>>>>>> We believe that integrating this feature into Iceberg would be a
>>>>>>>>>> valuable addition, and we're eager to collaborate with the community 
>>>>>>>>>> to
>>>>>>>>>> develop a UDF specification.
>>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a
>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>
>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> Dremio -
>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>> Snowflake -
>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>> Databricks -
>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Robert Stupp
>>>>>>>>> @snazy
>>>>>>>>>
>>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to