Hi everyone, We've only received one review so far (from Benny).
We would appreciate more eyes on this. - Ajantha On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi All, > Please find the proposal link > https://github.com/apache/iceberg/issues/10432 > > Google doc link is attached in the proposal. > And Thanks Stephen Lin <https://github.com/sxlin> for working on it. > > Hope it gives more clarity to take the decisions and how we want to > implement it. > > - Ajantha > > On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> Thanks Jack. I actually meant scalar/aggregate/table user defined >> functions. Here are some examples of what I meant in (2): >> >> Hive GenericUDF: >> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >> Trino user defined functions: >> https://trino.io/docs/current/develop/functions.html >> Flink user defined functions: >> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >> >> Probably what you referred to is a variation of (1) where the API is data >> flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that is >> also possible in the very long run :) >> >> Thanks, >> Walaa. >> >> >> >> >> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> > (2) Custom code written in imperative function according to a >>> Java/Scala/Python API, etc. >>> >>> I think we could still explore some long term opportunities in this >>> case. Consider you register a Spark temp view as some sort of data frame >>> read, then it could still be resolved to a Spark plan that is representable >>> by an intermediate representation. But I agree this gets very complicated >>> very soon, and just having the case (1) covered would already be a huge >>> step forward. >>> >>> -Jack >>> >>> >>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: >>> >>>> It's interesting to note that a tabular SQL UDF can be used to build a >>>> *parameterized >>>> *view. So, there's definitely a lot in common between UDFs and views. >>>> >>>> Thanks >>>> >>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>> There are 2 flavors: >>>>> >>>>> (1) Functions that are defined by the user whose definition is a >>>>> composition of other built-in functions/SQL expressions. >>>>> (2) Custom code written in imperative function according to a >>>>> Java/Scala/Python API, etc. >>>>> >>>>> All the examples in Ajantha's references are pretty much from (1) and >>>>> I think those have more analogy to views due to their SQL nature. Agree >>>>> (2) >>>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases >>>>> are around (1), and may be worth evaluating. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> I guess we'll know more when you post the proposal, but I think this >>>>>>> would be a very difficult area to tackle across engines, languages, and >>>>>>> memory models without having a huge performance penalty. >>>>>> >>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>> (similar to views as shared by the reference links above), the complexity >>>>>> involved will be similar to managing views. >>>>>> >>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>> We will work on publishing the draft spec (inspired by the view spec) >>>>>> this week to facilitate further discussions. >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>> >>>>>>> > While it would be great to have a common set of functions across >>>>>>> engines, I don't see how that is practical when those engines are >>>>>>> implemented so differently. Plugging in code -- and especially custom >>>>>>> user-supplied code -- seems inherently specialized to me and should be >>>>>>> part >>>>>>> of the engines' design. >>>>>>> >>>>>>> How is this different from the views? I feel we can say exactly the >>>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect >>>>>>> views >>>>>>> implemented. Maybe it sounds like we are trying to draw a line between >>>>>>> SQL >>>>>>> vs other programming language as "code"? but I think SQL is just another >>>>>>> type of code, and we are already talking about compiling all these >>>>>>> different code dialects to an intermediate representation (using >>>>>>> projects >>>>>>> like Coral, Substrait), which will be stored as another type of >>>>>>> representation of Iceberg view. I think the same functionality can be >>>>>>> used >>>>>>> for UDFs if developed. >>>>>>> >>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>> multi-dialect one like view, and that can allow engines to for example >>>>>>> parse a view SQL, and when a function referenced cannot be resolved, >>>>>>> try to >>>>>>> seek for a multi-dialect UDF definition. >>>>>>> >>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>> published. >>>>>>> >>>>>>> Best, >>>>>>> Jack Ye >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> wrote: >>>>>>> >>>>>>>> UDFs are as engine specific and portable and "non-centralized" as >>>>>>>> views are. The same performance concerns apply to views as well. >>>>>>>> Iceberg should define a common base upon which engines can build, >>>>>>>> so the argument that UDFs aren't practical, because engines are >>>>>>>> different, >>>>>>>> is probably only a temporary concern. >>>>>>>> >>>>>>>> In the long term, Iceberg should also try to tackle the idea to >>>>>>>> make views portable, which is conceptually not that much different from >>>>>>>> portable UDFs. >>>>>>>> >>>>>>>> >>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having >>>>>>>> UDFs in Iceberg, especially not in this early stage. >>>>>>>> >>>>>>>> >>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>> >>>>>>>> Thanks, Ajantha. >>>>>>>> >>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked by >>>>>>>> Iceberg catalogs. I think that Iceberg primarily deals with things >>>>>>>> that are >>>>>>>> centralized, like tables of data. While it would be great to have a >>>>>>>> common >>>>>>>> set of functions across engines, I don't see how that is practical when >>>>>>>> those engines are implemented so differently. Plugging in code -- and >>>>>>>> especially custom user-supplied code -- seems inherently specialized >>>>>>>> to me >>>>>>>> and should be part of the engines' design. >>>>>>>> >>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>> languages, >>>>>>>> and memory models without having a huge performance penalty. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Everyone, >>>>>>>>> >>>>>>>>> This is a discussion to gauge the community interest in storing >>>>>>>>> the Versioned SQL UDFs in Iceberg. >>>>>>>>> We want to propose the spec addition for storing the versioned >>>>>>>>> UDFs in Iceberg (inspired by view spec). >>>>>>>>> >>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>> associated with tables, but they can accept arguments and produce >>>>>>>>> return >>>>>>>>> values, or even function as inline expressions. >>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks Spark >>>>>>>>> supports SQL UDFs at catalog level [1]. >>>>>>>>> But storing them in Iceberg can enable >>>>>>>>> - Versioning of these UDFs. >>>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>>> understand the UDFs written by other engines (with the translate >>>>>>>>> layer). >>>>>>>>> >>>>>>>>> We believe that integrating this feature into Iceberg would be a >>>>>>>>> valuable addition, and we're eager to collaborate with the community >>>>>>>>> to >>>>>>>>> develop a UDF specification. >>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a >>>>>>>>> specification to propose to the community. >>>>>>>>> >>>>>>>>> Let us know your thoughts on this. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> Dremio - >>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>>> Snowflake - >>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>> Databricks - >>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>> >>>>>>>>> - Ajantha >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>>> -- >>>>>>>> Robert Stupp >>>>>>>> @snazy >>>>>>>> >>>>>>>>