Thanks Walaa and Robert for the review on this. We didn't find any blocker for the spec. I will wait for a week and If no more review comments, I will raise a PR for spec addition next week.
If anyone else is interested, please have a look at the proposal https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit - Ajantha On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <[email protected]> wrote: > Hi Ajantha, > > I have left some comments. It is an interesting direction, but there might > be some details that need to be fine tuned. > > The doc is here [1] for others who might be interested. Resharing since I > do not think it was directly linked in the thread. > > [1] > https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit > > Thanks, > Walaa. > > On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <[email protected]> > wrote: > >> Hi, just another reminder since we didn't get any review on the proposal. >> Initially proposed on June 4. >> >> - Ajantha >> >> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <[email protected]> >> wrote: >> >>> Hi everyone, >>> >>> We've only received one review so far (from Benny). >>> >>> We would appreciate more eyes on this. >>> >>> - Ajantha >>> >>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> Please find the proposal link >>>> https://github.com/apache/iceberg/issues/10432 >>>> >>>> Google doc link is attached in the proposal. >>>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it. >>>> >>>> Hope it gives more clarity to take the decisions and how we want to >>>> implement it. >>>> >>>> - Ajantha >>>> >>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>>> [email protected]> wrote: >>>> >>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined >>>>> functions. Here are some examples of what I meant in (2): >>>>> >>>>> Hive GenericUDF: >>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>> Trino user defined functions: >>>>> https://trino.io/docs/current/develop/functions.html >>>>> Flink user defined functions: >>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>> >>>>> Probably what you referred to is a variation of (1) where the API is >>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that >>>>> is also possible in the very long run :) >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]> wrote: >>>>> >>>>>> > (2) Custom code written in imperative function according to a >>>>>> Java/Scala/Python API, etc. >>>>>> >>>>>> I think we could still explore some long term opportunities in this >>>>>> case. Consider you register a Spark temp view as some sort of data frame >>>>>> read, then it could still be resolved to a Spark plan that is >>>>>> representable >>>>>> by an intermediate representation. But I agree this gets very complicated >>>>>> very soon, and just having the case (1) covered would already be a huge >>>>>> step forward. >>>>>> >>>>>> -Jack >>>>>> >>>>>> >>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]> wrote: >>>>>> >>>>>>> It's interesting to note that a tabular SQL UDF can be used to build >>>>>>> a *parameterized *view. So, there's definitely a lot in common >>>>>>> between UDFs and views. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>>>>> There are 2 flavors: >>>>>>>> >>>>>>>> (1) Functions that are defined by the user whose definition is a >>>>>>>> composition of other built-in functions/SQL expressions. >>>>>>>> (2) Custom code written in imperative function according to a >>>>>>>> Java/Scala/Python API, etc. >>>>>>>> >>>>>>>> All the examples in Ajantha's references are pretty much from (1) >>>>>>>> and I think those have more analogy to views due to their SQL nature. >>>>>>>> Agree >>>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's use >>>>>>>> cases are around (1), and may be worth evaluating. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>> languages, >>>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>>> >>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>>>>> (similar to views as shared by the reference links above), the >>>>>>>>> complexity >>>>>>>>> involved will be similar to managing views. >>>>>>>>> >>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>>> We will work on publishing the draft spec (inspired by the view >>>>>>>>> spec) this week to facilitate further discussions. >>>>>>>>> >>>>>>>>> - Ajantha >>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > While it would be great to have a common set of functions >>>>>>>>>> across engines, I don't see how that is practical when those engines >>>>>>>>>> are >>>>>>>>>> implemented so differently. Plugging in code -- and especially custom >>>>>>>>>> user-supplied code -- seems inherently specialized to me and should >>>>>>>>>> be part >>>>>>>>>> of the engines' design. >>>>>>>>>> >>>>>>>>>> How is this different from the views? I feel we can say exactly >>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg >>>>>>>>>> multi-dialect >>>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a line >>>>>>>>>> between SQL vs other programming language as "code"? but I think SQL >>>>>>>>>> is >>>>>>>>>> just another type of code, and we are already talking about >>>>>>>>>> compiling all >>>>>>>>>> these different code dialects to an intermediate representation >>>>>>>>>> (using >>>>>>>>>> projects like Coral, Substrait), which will be stored as another >>>>>>>>>> type of >>>>>>>>>> representation of Iceberg view. I think the same functionality can >>>>>>>>>> be used >>>>>>>>>> for UDFs if developed. >>>>>>>>>> >>>>>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>>>>> multi-dialect one like view, and that can allow engines to for >>>>>>>>>> example >>>>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, >>>>>>>>>> try to >>>>>>>>>> seek for a multi-dialect UDF definition. >>>>>>>>>> >>>>>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>>>>> published. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jack Ye >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" >>>>>>>>>>> as views are. The same performance concerns apply to views as well. >>>>>>>>>>> Iceberg should define a common base upon which engines can >>>>>>>>>>> build, so the argument that UDFs aren't practical, because engines >>>>>>>>>>> are >>>>>>>>>>> different, is probably only a temporary concern. >>>>>>>>>>> >>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to >>>>>>>>>>> make views portable, which is conceptually not that much different >>>>>>>>>>> from >>>>>>>>>>> portable UDFs. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of >>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>> >>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked >>>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with >>>>>>>>>>> things that >>>>>>>>>>> are centralized, like tables of data. While it would be great to >>>>>>>>>>> have a >>>>>>>>>>> common set of functions across engines, I don't see how that is >>>>>>>>>>> practical >>>>>>>>>>> when those engines are implemented so differently. Plugging in code >>>>>>>>>>> -- and >>>>>>>>>>> especially custom user-supplied code -- seems inherently >>>>>>>>>>> specialized to me >>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>> >>>>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>>> languages, >>>>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>> >>>>>>>>>>>> This is a discussion to gauge the community interest in storing >>>>>>>>>>>> the Versioned SQL UDFs in Iceberg. >>>>>>>>>>>> We want to propose the spec addition for storing the versioned >>>>>>>>>>>> UDFs in Iceberg (inspired by view spec). >>>>>>>>>>>> >>>>>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>>>>> associated with tables, but they can accept arguments and produce >>>>>>>>>>>> return >>>>>>>>>>>> values, or even function as inline expressions. >>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks >>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1]. >>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>>>>>> understand the UDFs written by other engines (with the translate >>>>>>>>>>>> layer). >>>>>>>>>>>> >>>>>>>>>>>> We believe that integrating this feature into Iceberg would be >>>>>>>>>>>> a valuable addition, and we're eager to collaborate with the >>>>>>>>>>>> community to >>>>>>>>>>>> develop a UDF specification. >>>>>>>>>>>> Stephen <[email protected]> has already begun drafting a >>>>>>>>>>>> specification to propose to the community. >>>>>>>>>>>> >>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> Dremio - >>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>> Snowflake - >>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>> Databricks - >>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Robert Stupp >>>>>>>>>>> @snazy >>>>>>>>>>> >>>>>>>>>>>
