Hi All, Please find the proposal link https://github.com/apache/iceberg/issues/10432
Google doc link is attached in the proposal. And Thanks Stephen Lin <https://github.com/sxlin> for working on it. Hope it gives more clarity to take the decisions and how we want to implement it. - Ajantha On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Jack. I actually meant scalar/aggregate/table user defined > functions. Here are some examples of what I meant in (2): > > Hive GenericUDF: > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java > Trino user defined functions: > https://trino.io/docs/current/develop/functions.html > Flink user defined functions: > https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ > > Probably what you referred to is a variation of (1) where the API is data > flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is > also possible in the very long run :) > > Thanks, > Walaa. > > > > > On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: > >> > (2) Custom code written in imperative function according to a >> Java/Scala/Python API, etc. >> >> I think we could still explore some long term opportunities in this case. >> Consider you register a Spark temp view as some sort of data frame read, >> then it could still be resolved to a Spark plan that is representable by an >> intermediate representation. But I agree this gets very complicated very >> soon, and just having the case (1) covered would already be a huge step >> forward. >> >> -Jack >> >> >> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: >> >>> It's interesting to note that a tabular SQL UDF can be used to build a >>> *parameterized >>> *view. So, there's definitely a lot in common between UDFs and views. >>> >>> Thanks >>> >>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> I think there is a disconnect about what is perceived as a "UDF". There >>>> are 2 flavors: >>>> >>>> (1) Functions that are defined by the user whose definition is a >>>> composition of other built-in functions/SQL expressions. >>>> (2) Custom code written in imperative function according to a >>>> Java/Scala/Python API, etc. >>>> >>>> All the examples in Ajantha's references are pretty much from (1) and I >>>> think those have more analogy to views due to their SQL nature. Agree (2) >>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases >>>> are around (1), and may be worth evaluating. >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> I guess we'll know more when you post the proposal, but I think this >>>>>> would be a very difficult area to tackle across engines, languages, and >>>>>> memory models without having a huge performance penalty. >>>>> >>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>> (similar to views as shared by the reference links above), the complexity >>>>> involved will be similar to managing views. >>>>> >>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>> We will work on publishing the draft spec (inspired by the view spec) >>>>> this week to facilitate further discussions. >>>>> >>>>> - Ajantha >>>>> >>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> > While it would be great to have a common set of functions across >>>>>> engines, I don't see how that is practical when those engines are >>>>>> implemented so differently. Plugging in code -- and especially custom >>>>>> user-supplied code -- seems inherently specialized to me and should be >>>>>> part >>>>>> of the engines' design. >>>>>> >>>>>> How is this different from the views? I feel we can say exactly the >>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect views >>>>>> implemented. Maybe it sounds like we are trying to draw a line between >>>>>> SQL >>>>>> vs other programming language as "code"? but I think SQL is just another >>>>>> type of code, and we are already talking about compiling all these >>>>>> different code dialects to an intermediate representation (using projects >>>>>> like Coral, Substrait), which will be stored as another type of >>>>>> representation of Iceberg view. I think the same functionality can be >>>>>> used >>>>>> for UDFs if developed. >>>>>> >>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>> multi-dialect one like view, and that can allow engines to for example >>>>>> parse a view SQL, and when a function referenced cannot be resolved, try >>>>>> to >>>>>> seek for a multi-dialect UDF definition. >>>>>> >>>>>> I guess we can discuss more when we have the actual proposal >>>>>> published. >>>>>> >>>>>> Best, >>>>>> Jack Ye >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> wrote: >>>>>> >>>>>>> UDFs are as engine specific and portable and "non-centralized" as >>>>>>> views are. The same performance concerns apply to views as well. >>>>>>> Iceberg should define a common base upon which engines can build, so >>>>>>> the argument that UDFs aren't practical, because engines are different, >>>>>>> is >>>>>>> probably only a temporary concern. >>>>>>> >>>>>>> In the long term, Iceberg should also try to tackle the idea to make >>>>>>> views portable, which is conceptually not that much different from >>>>>>> portable >>>>>>> UDFs. >>>>>>> >>>>>>> >>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having >>>>>>> UDFs in Iceberg, especially not in this early stage. >>>>>>> >>>>>>> >>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>> >>>>>>> Thanks, Ajantha. >>>>>>> >>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by >>>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things that >>>>>>> are >>>>>>> centralized, like tables of data. While it would be great to have a >>>>>>> common >>>>>>> set of functions across engines, I don't see how that is practical when >>>>>>> those engines are implemented so differently. Plugging in code -- and >>>>>>> especially custom user-supplied code -- seems inherently specialized to >>>>>>> me >>>>>>> and should be part of the engines' design. >>>>>>> >>>>>>> I guess we'll know more when you post the proposal, but I think this >>>>>>> would be a very difficult area to tackle across engines, languages, and >>>>>>> memory models without having a huge performance penalty. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> >>>>>>>> This is a discussion to gauge the community interest in storing the >>>>>>>> Versioned SQL UDFs in Iceberg. >>>>>>>> We want to propose the spec addition for storing the versioned UDFs >>>>>>>> in Iceberg (inspired by view spec). >>>>>>>> >>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>> associated with tables, but they can accept arguments and produce >>>>>>>> return >>>>>>>> values, or even function as inline expressions. >>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark >>>>>>>> supports SQL UDFs at catalog level [1]. >>>>>>>> But storing them in Iceberg can enable >>>>>>>> - Versioning of these UDFs. >>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>> understand the UDFs written by other engines (with the translate >>>>>>>> layer). >>>>>>>> >>>>>>>> We believe that integrating this feature into Iceberg would be a >>>>>>>> valuable addition, and we're eager to collaborate with the community to >>>>>>>> develop a UDF specification. >>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a >>>>>>>> specification to propose to the community. >>>>>>>> >>>>>>>> Let us know your thoughts on this. >>>>>>>> >>>>>>>> [1] >>>>>>>> Dremio - >>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>> Snowflake - >>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>> Databricks - >>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>> >>>>>>>> - Ajantha >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>>> -- >>>>>>> Robert Stupp >>>>>>> @snazy >>>>>>> >>>>>>>