Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Ajantha Bhat Thu, 01 Aug 2024 03:15:37 -0700

Thanks Walaa and Robert for the review on this.

We didn't find any blocker for the spec.
I will wait for a week and If no more review comments, I will raise a PR
for spec addition next week.


If anyone else is interested, please have a look at the proposal
https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit

- Ajantha

On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <[email protected]>
wrote:

> Hi Ajantha,
>
> I have left some comments. It is an interesting direction, but there might
> be some details that need to be fine tuned.
>
> The doc is here [1] for others who might be interested. Resharing since I
> do not think it was directly linked in the thread.
>
> [1]
> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>
> Thanks,
> Walaa.
>
> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <[email protected]>
> wrote:
>
>> Hi, just another reminder since we didn't get any review on the proposal.
>> Initially proposed on June 4.
>>
>> - Ajantha
>>
>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <[email protected]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> We've only received one review so far (from Benny).
>>>
>>> We would appreciate more eyes on this.
>>>
>>> - Ajantha
>>>
>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>> Please find the proposal link
>>>> https://github.com/apache/iceberg/issues/10432
>>>>
>>>> Google doc link is attached in the proposal.
>>>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it.
>>>>
>>>> Hope it gives more clarity to take the decisions and how we want to
>>>> implement it.
>>>>
>>>> - Ajantha
>>>>
>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined
>>>>> functions. Here are some examples of what I meant in (2):
>>>>>
>>>>> Hive GenericUDF:
>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>> Trino user defined functions:
>>>>> https://trino.io/docs/current/develop/functions.html
>>>>> Flink user defined functions:
>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>
>>>>> Probably what you referred to is a variation of (1) where the API is
>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that
>>>>> is also possible in the very long run :)
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]> wrote:
>>>>>
>>>>>> > (2) Custom code written in imperative function according to a
>>>>>> Java/Scala/Python API, etc.
>>>>>>
>>>>>> I think we could still explore some long term opportunities in this
>>>>>> case. Consider you register a Spark temp view as some sort of data frame
>>>>>> read, then it could still be resolved to a Spark plan that is 
>>>>>> representable
>>>>>> by an intermediate representation. But I agree this gets very complicated
>>>>>> very soon, and just having the case (1) covered would already be a huge
>>>>>> step forward.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]> wrote:
>>>>>>
>>>>>>> It's interesting to note that a tabular SQL UDF can be used to build
>>>>>>> a *parameterized *view.  So, there's definitely a lot in common
>>>>>>> between UDFs and views.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think there is a disconnect about what is perceived as a "UDF".
>>>>>>>> There are 2 flavors:
>>>>>>>>
>>>>>>>> (1) Functions that are defined by the user whose definition is a
>>>>>>>> composition of other built-in functions/SQL expressions.
>>>>>>>> (2) Custom code written in imperative function according to a
>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>
>>>>>>>> All the examples in Ajantha's references are pretty much from (1)
>>>>>>>> and I think those have more analogy to views due to their SQL nature. 
>>>>>>>> Agree
>>>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's use
>>>>>>>> cases are around (1), and may be worth evaluating.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>> languages,
>>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>
>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>>>>>> (similar to views as shared by the reference links above), the 
>>>>>>>>> complexity
>>>>>>>>> involved will be similar to managing views.
>>>>>>>>>
>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>> We will work on publishing the draft spec (inspired by the view
>>>>>>>>> spec) this week to facilitate further discussions.
>>>>>>>>>
>>>>>>>>> - Ajantha
>>>>>>>>>
>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > While it would be great to have a common set of functions
>>>>>>>>>> across engines, I don't see how that is practical when those engines 
>>>>>>>>>> are
>>>>>>>>>> implemented so differently. Plugging in code -- and especially custom
>>>>>>>>>> user-supplied code -- seems inherently specialized to me and should 
>>>>>>>>>> be part
>>>>>>>>>> of the engines' design.
>>>>>>>>>>
>>>>>>>>>> How is this different from the views? I feel we can say exactly
>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg 
>>>>>>>>>> multi-dialect
>>>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a line
>>>>>>>>>> between SQL vs other programming language as "code"? but I think SQL 
>>>>>>>>>> is
>>>>>>>>>> just another type of code, and we are already talking about 
>>>>>>>>>> compiling all
>>>>>>>>>> these different code dialects to an intermediate representation 
>>>>>>>>>> (using
>>>>>>>>>> projects like Coral, Substrait), which will be stored as another 
>>>>>>>>>> type of
>>>>>>>>>> representation of Iceberg view. I think the same functionality can 
>>>>>>>>>> be used
>>>>>>>>>> for UDFs if developed.
>>>>>>>>>>
>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>>>>>> multi-dialect one like view, and that can allow engines to for 
>>>>>>>>>> example
>>>>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, 
>>>>>>>>>> try to
>>>>>>>>>> seek for a multi-dialect UDF definition.
>>>>>>>>>>
>>>>>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>>>>>> published.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized"
>>>>>>>>>>> as views are. The same performance concerns apply to views as well.
>>>>>>>>>>> Iceberg should define a common base upon which engines can
>>>>>>>>>>> build, so the argument that UDFs aren't practical, because engines 
>>>>>>>>>>> are
>>>>>>>>>>> different, is probably only a temporary concern.
>>>>>>>>>>>
>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to
>>>>>>>>>>> make views portable, which is conceptually not that much different 
>>>>>>>>>>> from
>>>>>>>>>>> portable UDFs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of
>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>
>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked
>>>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with 
>>>>>>>>>>> things that
>>>>>>>>>>> are centralized, like tables of data. While it would be great to 
>>>>>>>>>>> have a
>>>>>>>>>>> common set of functions across engines, I don't see how that is 
>>>>>>>>>>> practical
>>>>>>>>>>> when those engines are implemented so differently. Plugging in code 
>>>>>>>>>>> -- and
>>>>>>>>>>> especially custom user-supplied code -- seems inherently 
>>>>>>>>>>> specialized to me
>>>>>>>>>>> and should be part of the engines' design.
>>>>>>>>>>>
>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>> languages,
>>>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> This is a discussion to gauge the community interest in storing
>>>>>>>>>>>> the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>> We want to propose the spec addition for storing the versioned
>>>>>>>>>>>> UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>
>>>>>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>>>>>> associated with tables, but they can accept arguments and produce 
>>>>>>>>>>>> return
>>>>>>>>>>>> values, or even function as inline expressions.
>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks
>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>> - Interoperability between the engines. Potentially engines can
>>>>>>>>>>>> understand the UDFs written by other engines (with the translate 
>>>>>>>>>>>> layer).
>>>>>>>>>>>>
>>>>>>>>>>>> We believe that integrating this feature into Iceberg would be
>>>>>>>>>>>> a valuable addition, and we're eager to collaborate with the 
>>>>>>>>>>>> community to
>>>>>>>>>>>> develop a UDF specification.
>>>>>>>>>>>> Stephen <[email protected]> has already begun drafting a
>>>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>>>
>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> Dremio -
>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>> Databricks -
>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>
>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Robert Stupp
>>>>>>>>>>> @snazy
>>>>>>>>>>>
>>>>>>>>>>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to