Hi Ajantha, I have left some comments. It is an interesting direction, but there might be some details that need to be fine tuned.
The doc is here [1] for others who might be interested. Resharing since I do not think it was directly linked in the thread. [1] https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit Thanks, Walaa. On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi, just another reminder since we didn't get any review on the proposal. > Initially proposed on June 4. > > - Ajantha > > On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> Hi everyone, >> >> We've only received one review so far (from Benny). >> >> We would appreciate more eyes on this. >> >> - Ajantha >> >> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> Hi All, >>> Please find the proposal link >>> https://github.com/apache/iceberg/issues/10432 >>> >>> Google doc link is attached in the proposal. >>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it. >>> >>> Hope it gives more clarity to take the decisions and how we want to >>> implement it. >>> >>> - Ajantha >>> >>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> Thanks Jack. I actually meant scalar/aggregate/table user defined >>>> functions. Here are some examples of what I meant in (2): >>>> >>>> Hive GenericUDF: >>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>> Trino user defined functions: >>>> https://trino.io/docs/current/develop/functions.html >>>> Flink user defined functions: >>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>> >>>> Probably what you referred to is a variation of (1) where the API is >>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that >>>> is also possible in the very long run :) >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> >>>> >>>> >>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> > (2) Custom code written in imperative function according to a >>>>> Java/Scala/Python API, etc. >>>>> >>>>> I think we could still explore some long term opportunities in this >>>>> case. Consider you register a Spark temp view as some sort of data frame >>>>> read, then it could still be resolved to a Spark plan that is >>>>> representable >>>>> by an intermediate representation. But I agree this gets very complicated >>>>> very soon, and just having the case (1) covered would already be a huge >>>>> step forward. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> It's interesting to note that a tabular SQL UDF can be used to build >>>>>> a *parameterized *view. So, there's definitely a lot in common >>>>>> between UDFs and views. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>>>> There are 2 flavors: >>>>>>> >>>>>>> (1) Functions that are defined by the user whose definition is a >>>>>>> composition of other built-in functions/SQL expressions. >>>>>>> (2) Custom code written in imperative function according to a >>>>>>> Java/Scala/Python API, etc. >>>>>>> >>>>>>> All the examples in Ajantha's references are pretty much from (1) >>>>>>> and I think those have more analogy to views due to their SQL nature. >>>>>>> Agree >>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's use >>>>>>> cases are around (1), and may be worth evaluating. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>> languages, >>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>> >>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>>>> (similar to views as shared by the reference links above), the >>>>>>>> complexity >>>>>>>> involved will be similar to managing views. >>>>>>>> >>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>> We will work on publishing the draft spec (inspired by the view >>>>>>>> spec) this week to facilitate further discussions. >>>>>>>> >>>>>>>> - Ajantha >>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> > While it would be great to have a common set of functions across >>>>>>>>> engines, I don't see how that is practical when those engines are >>>>>>>>> implemented so differently. Plugging in code -- and especially custom >>>>>>>>> user-supplied code -- seems inherently specialized to me and should >>>>>>>>> be part >>>>>>>>> of the engines' design. >>>>>>>>> >>>>>>>>> How is this different from the views? I feel we can say exactly >>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg >>>>>>>>> multi-dialect >>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a line >>>>>>>>> between SQL vs other programming language as "code"? but I think SQL >>>>>>>>> is >>>>>>>>> just another type of code, and we are already talking about compiling >>>>>>>>> all >>>>>>>>> these different code dialects to an intermediate representation (using >>>>>>>>> projects like Coral, Substrait), which will be stored as another type >>>>>>>>> of >>>>>>>>> representation of Iceberg view. I think the same functionality can be >>>>>>>>> used >>>>>>>>> for UDFs if developed. >>>>>>>>> >>>>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>>>> multi-dialect one like view, and that can allow engines to for example >>>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, >>>>>>>>> try to >>>>>>>>> seek for a multi-dialect UDF definition. >>>>>>>>> >>>>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>>>> published. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Jack Ye >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" as >>>>>>>>>> views are. The same performance concerns apply to views as well. >>>>>>>>>> Iceberg should define a common base upon which engines can build, >>>>>>>>>> so the argument that UDFs aren't practical, because engines are >>>>>>>>>> different, >>>>>>>>>> is probably only a temporary concern. >>>>>>>>>> >>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to >>>>>>>>>> make views portable, which is conceptually not that much different >>>>>>>>>> from >>>>>>>>>> portable UDFs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of >>>>>>>>>> having UDFs in Iceberg, especially not in this early stage. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>> >>>>>>>>>> Thanks, Ajantha. >>>>>>>>>> >>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked >>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with >>>>>>>>>> things that >>>>>>>>>> are centralized, like tables of data. While it would be great to >>>>>>>>>> have a >>>>>>>>>> common set of functions across engines, I don't see how that is >>>>>>>>>> practical >>>>>>>>>> when those engines are implemented so differently. Plugging in code >>>>>>>>>> -- and >>>>>>>>>> especially custom user-supplied code -- seems inherently specialized >>>>>>>>>> to me >>>>>>>>>> and should be part of the engines' design. >>>>>>>>>> >>>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>> languages, >>>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Everyone, >>>>>>>>>>> >>>>>>>>>>> This is a discussion to gauge the community interest in storing >>>>>>>>>>> the Versioned SQL UDFs in Iceberg. >>>>>>>>>>> We want to propose the spec addition for storing the versioned >>>>>>>>>>> UDFs in Iceberg (inspired by view spec). >>>>>>>>>>> >>>>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>>>> associated with tables, but they can accept arguments and produce >>>>>>>>>>> return >>>>>>>>>>> values, or even function as inline expressions. >>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks >>>>>>>>>>> Spark supports SQL UDFs at catalog level [1]. >>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>>>>> understand the UDFs written by other engines (with the translate >>>>>>>>>>> layer). >>>>>>>>>>> >>>>>>>>>>> We believe that integrating this feature into Iceberg would be a >>>>>>>>>>> valuable addition, and we're eager to collaborate with the >>>>>>>>>>> community to >>>>>>>>>>> develop a UDF specification. >>>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a >>>>>>>>>>> specification to propose to the community. >>>>>>>>>>> >>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> Dremio - >>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>> Snowflake - >>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>> Databricks - >>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Robert Stupp >>>>>>>>>> @snazy >>>>>>>>>> >>>>>>>>>>