Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Walaa Eldin Moustafa Wed, 07 Aug 2024 12:14:17 -0700

Piotr, what do you mean by making user-created functions shareable
between engines? Do you mean UDFs written in imperative code?


On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
<[email protected]> wrote:
>
> Hi,
>
> Thank you Ajantha for creating this thread. The Iceberg UDFs are an 
> interesting idea!
> Is there a plan to make the user-created functions sharable between the 
> engines?
> If so, how would a CREATE FUNCTION statement look like in e..g Spark or Trino?
>
> Meanwhile, added a few comments in the doc.
>
> Best
> Piotr
>
>
> On Thu, 1 Aug 2024 at 20:50, Ryan Blue <[email protected]> wrote:
>>
>> I just looked through the proposal and added comments. I think it would be 
>> helpful to also have a design doc that covers the choices from the draft 
>> spec. For instance, the choice to enumerate all possible function input 
>> struts rather than allowing generics and varargs.
>>
>> Here’s a quick summary of my feedback:
>>
>> I think that the choice to enumerate function signatures is limiting. It 
>> would be nice to see a discussion of the trade-offs and a rationale for the 
>> choice. I think it would also be very helpful to have a few representative 
>> use cases for this included in the doc. That way the proposal can 
>> demonstrate that it solves those use cases with reasonable trade-offs.
>> There are a few instances where this is inconsistent with conventions in 
>> other specs. For example, using string IDs rather than an integer.
>> This uses a very different model for spec versioning than the Iceberg view 
>> and table specs. It requires readers to fail if there are any unknown 
>> fields, which prevents the spec from adding things that are fully 
>> backward-compatible. Other Iceberg specs only require a version change to 
>> introduce forward-incompatible changes and I think that this should do the 
>> same to avoid confusion.
>> It looks like the intent is to allow multiple function signatures per 
>> verison, but it is unclear how to encode them because a version is 
>> associated with a single function signature.
>> There is no review of SQL syntax for creating functions across engines, so 
>> this doesn’t show that the metadata proposed is sufficient for cross-engine 
>> use cases.
>> The example for a table-valued function shows a SELECT statement and it 
>> isn’t clear how this is distinct from a view
>>
>>
>> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <[email protected]> wrote:
>>>
>>> Thanks Walaa and Robert for the review on this.
>>>
>>> We didn't find any blocker for the spec.
>>> I will wait for a week and If no more review comments, I will raise a PR 
>>> for spec addition next week.
>>>
>>> If anyone else is interested, please have a look at the proposal 
>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>
>>> - Ajantha
>>>
>>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa 
>>> <[email protected]> wrote:
>>>>
>>>> Hi Ajantha,
>>>>
>>>> I have left some comments. It is an interesting direction, but there might 
>>>> be some details that need to be fine tuned.
>>>>
>>>> The doc is here [1] for others who might be interested. Resharing since I 
>>>> do not think it was directly linked in the thread.
>>>>
>>>> [1] 
>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <[email protected]> 
>>>> wrote:
>>>>>
>>>>> Hi, just another reminder since we didn't get any review on the proposal.
>>>>> Initially proposed on June 4.
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <[email protected]> 
>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We've only received one review so far (from Benny).
>>>>>>
>>>>>> We would appreciate more eyes on this.
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <[email protected]> 
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>> Please find the proposal link
>>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>>
>>>>>>> Google doc link is attached in the proposal.
>>>>>>> And Thanks Stephen Lin for working on it.
>>>>>>>
>>>>>>> Hope it gives more clarity to take the decisions and how we want to 
>>>>>>> implement it.
>>>>>>>
>>>>>>> - Ajantha
>>>>>>>
>>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa 
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined 
>>>>>>>> functions. Here are some examples of what I meant in (2):
>>>>>>>>
>>>>>>>> Hive GenericUDF: 
>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>>> Trino user defined functions: 
>>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>>> Flink user defined functions: 
>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>>
>>>>>>>> Probably what you referred to is a variation of (1) where the API is 
>>>>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, 
>>>>>>>> that is also possible in the very long run :)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> > (2) Custom code written in imperative function according to a 
>>>>>>>>> > Java/Scala/Python API, etc.
>>>>>>>>>
>>>>>>>>> I think we could still explore some long term opportunities in this 
>>>>>>>>> case. Consider you register a Spark temp view as some sort of data 
>>>>>>>>> frame read, then it could still be resolved to a Spark plan that is 
>>>>>>>>> representable by an intermediate representation. But I agree this 
>>>>>>>>> gets very complicated very soon, and just having the case (1) covered 
>>>>>>>>> would already be a huge step forward.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to build 
>>>>>>>>>> a parameterized view.  So, there's definitely a lot in common 
>>>>>>>>>> between UDFs and views.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think there is a disconnect about what is perceived as a "UDF". 
>>>>>>>>>>> There are 2 flavors:
>>>>>>>>>>>
>>>>>>>>>>> (1) Functions that are defined by the user whose definition is a 
>>>>>>>>>>> composition of other built-in functions/SQL expressions.
>>>>>>>>>>> (2) Custom code written in imperative function according to a 
>>>>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>>>>
>>>>>>>>>>> All the examples in Ajantha's references are pretty much from (1) 
>>>>>>>>>>> and I think those have more analogy to views due to their SQL 
>>>>>>>>>>> nature. Agree (2) is not practical to maintain by Iceberg, but I 
>>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth 
>>>>>>>>>>> evaluating.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Walaa.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat 
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think 
>>>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>>>> languages, and memory models without having a huge performance 
>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>
>>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs 
>>>>>>>>>>>> (similar to views as shared by the reference links above), the 
>>>>>>>>>>>> complexity involved will be similar to managing views.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>>>> We will work on publishing the draft spec (inspired by the view 
>>>>>>>>>>>> spec) this week to facilitate further discussions.
>>>>>>>>>>>>
>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > While it would be great to have a common set of functions 
>>>>>>>>>>>>> > across engines, I don't see how that is practical when those 
>>>>>>>>>>>>> > engines are implemented so differently. Plugging in code -- and 
>>>>>>>>>>>>> > especially custom user-supplied code -- seems inherently 
>>>>>>>>>>>>> > specialized to me and should be part of the engines' design.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How is this different from the views? I feel we can say exactly 
>>>>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg 
>>>>>>>>>>>>> multi-dialect views implemented. Maybe it sounds like we are 
>>>>>>>>>>>>> trying to draw a line between SQL vs other programming language 
>>>>>>>>>>>>> as "code"? but I think SQL is just another type of code, and we 
>>>>>>>>>>>>> are already talking about compiling all these different code 
>>>>>>>>>>>>> dialects to an intermediate representation (using projects like 
>>>>>>>>>>>>> Coral, Substrait), which will be stored as another type of 
>>>>>>>>>>>>> representation of Iceberg view. I think the same functionality 
>>>>>>>>>>>>> can be used for UDFs if developed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a 
>>>>>>>>>>>>> multi-dialect one like view, and that can allow engines to for 
>>>>>>>>>>>>> example parse a view SQL, and when a function referenced cannot 
>>>>>>>>>>>>> be resolved, try to seek for a multi-dialect UDF definition.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess we can discuss more when we have the actual proposal 
>>>>>>>>>>>>> published.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" 
>>>>>>>>>>>>>> as views are. The same performance concerns apply to views as 
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Iceberg should define a common base upon which engines can 
>>>>>>>>>>>>>> build, so the argument that UDFs aren't practical, because 
>>>>>>>>>>>>>> engines are different, is probably only a temporary concern.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to 
>>>>>>>>>>>>>> make views portable, which is conceptually not that much 
>>>>>>>>>>>>>> different from portable UDFs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of 
>>>>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked 
>>>>>>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with 
>>>>>>>>>>>>>> things that are centralized, like tables of data. While it would 
>>>>>>>>>>>>>> be great to have a common set of functions across engines, I 
>>>>>>>>>>>>>> don't see how that is practical when those engines are 
>>>>>>>>>>>>>> implemented so differently. Plugging in code -- and especially 
>>>>>>>>>>>>>> custom user-supplied code -- seems inherently specialized to me 
>>>>>>>>>>>>>> and should be part of the engines' design.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think 
>>>>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>>>>> languages, and memory models without having a huge performance 
>>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat 
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is a discussion to gauge the community interest in storing 
>>>>>>>>>>>>>>> the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>>>>> We want to propose the spec addition for storing the versioned 
>>>>>>>>>>>>>>> UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are 
>>>>>>>>>>>>>>> associated with tables, but they can accept arguments and 
>>>>>>>>>>>>>>> produce return values, or even function as inline expressions.
>>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks 
>>>>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>>>> - Interoperability between the engines. Potentially engines can 
>>>>>>>>>>>>>>> understand the UDFs written by other engines (with the 
>>>>>>>>>>>>>>> translate layer).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg would be 
>>>>>>>>>>>>>>> a valuable addition, and we're eager to collaborate with the 
>>>>>>>>>>>>>>> community to develop a UDF specification.
>>>>>>>>>>>>>>> Stephen has already begun drafting a specification to propose 
>>>>>>>>>>>>>>> to the community.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> Dremio - 
>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>>>> Snowflake - 
>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>>>> Databricks - 
>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>>>> @snazy
>>
>>
>>
>> --
>> Ryan Blue
>> Databricks

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to