Hi, just another reminder since we didn't get any review on the proposal. Initially proposed on June 4.
- Ajantha On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hi everyone, > > We've only received one review so far (from Benny). > > We would appreciate more eyes on this. > > - Ajantha > > On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > >> Hi All, >> Please find the proposal link >> https://github.com/apache/iceberg/issues/10432 >> >> Google doc link is attached in the proposal. >> And Thanks Stephen Lin <https://github.com/sxlin> for working on it. >> >> Hope it gives more clarity to take the decisions and how we want to >> implement it. >> >> - Ajantha >> >> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Jack. I actually meant scalar/aggregate/table user defined >>> functions. Here are some examples of what I meant in (2): >>> >>> Hive GenericUDF: >>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>> Trino user defined functions: >>> https://trino.io/docs/current/develop/functions.html >>> Flink user defined functions: >>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>> >>> Probably what you referred to is a variation of (1) where the API is >>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that >>> is also possible in the very long run :) >>> >>> Thanks, >>> Walaa. >>> >>> >>> >>> >>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> > (2) Custom code written in imperative function according to a >>>> Java/Scala/Python API, etc. >>>> >>>> I think we could still explore some long term opportunities in this >>>> case. Consider you register a Spark temp view as some sort of data frame >>>> read, then it could still be resolved to a Spark plan that is representable >>>> by an intermediate representation. But I agree this gets very complicated >>>> very soon, and just having the case (1) covered would already be a huge >>>> step forward. >>>> >>>> -Jack >>>> >>>> >>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: >>>> >>>>> It's interesting to note that a tabular SQL UDF can be used to build a >>>>> *parameterized >>>>> *view. So, there's definitely a lot in common between UDFs and views. >>>>> >>>>> Thanks >>>>> >>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>>> wa.moust...@gmail.com> wrote: >>>>> >>>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>>> There are 2 flavors: >>>>>> >>>>>> (1) Functions that are defined by the user whose definition is a >>>>>> composition of other built-in functions/SQL expressions. >>>>>> (2) Custom code written in imperative function according to a >>>>>> Java/Scala/Python API, etc. >>>>>> >>>>>> All the examples in Ajantha's references are pretty much from (1) and >>>>>> I think those have more analogy to views due to their SQL nature. Agree >>>>>> (2) >>>>>> is not practical to maintain by Iceberg, but I think Ajantha's use cases >>>>>> are around (1), and may be worth evaluating. >>>>>> >>>>>> Thanks, >>>>>> Walaa. >>>>>> >>>>>> >>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I guess we'll know more when you post the proposal, but I think this >>>>>>>> would be a very difficult area to tackle across engines, languages, and >>>>>>>> memory models without having a huge performance penalty. >>>>>>> >>>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>>> (similar to views as shared by the reference links above), the >>>>>>> complexity >>>>>>> involved will be similar to managing views. >>>>>>> >>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>> We will work on publishing the draft spec (inspired by the view >>>>>>> spec) this week to facilitate further discussions. >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>> > While it would be great to have a common set of functions across >>>>>>>> engines, I don't see how that is practical when those engines are >>>>>>>> implemented so differently. Plugging in code -- and especially custom >>>>>>>> user-supplied code -- seems inherently specialized to me and should be >>>>>>>> part >>>>>>>> of the engines' design. >>>>>>>> >>>>>>>> How is this different from the views? I feel we can say exactly the >>>>>>>> same thing for Iceberg views, but yet we have Iceberg multi-dialect >>>>>>>> views >>>>>>>> implemented. Maybe it sounds like we are trying to draw a line between >>>>>>>> SQL >>>>>>>> vs other programming language as "code"? but I think SQL is just >>>>>>>> another >>>>>>>> type of code, and we are already talking about compiling all these >>>>>>>> different code dialects to an intermediate representation (using >>>>>>>> projects >>>>>>>> like Coral, Substrait), which will be stored as another type of >>>>>>>> representation of Iceberg view. I think the same functionality can be >>>>>>>> used >>>>>>>> for UDFs if developed. >>>>>>>> >>>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>>> multi-dialect one like view, and that can allow engines to for example >>>>>>>> parse a view SQL, and when a function referenced cannot be resolved, >>>>>>>> try to >>>>>>>> seek for a multi-dialect UDF definition. >>>>>>>> >>>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>>> published. >>>>>>>> >>>>>>>> Best, >>>>>>>> Jack Ye >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> UDFs are as engine specific and portable and "non-centralized" as >>>>>>>>> views are. The same performance concerns apply to views as well. >>>>>>>>> Iceberg should define a common base upon which engines can build, >>>>>>>>> so the argument that UDFs aren't practical, because engines are >>>>>>>>> different, >>>>>>>>> is probably only a temporary concern. >>>>>>>>> >>>>>>>>> In the long term, Iceberg should also try to tackle the idea to >>>>>>>>> make views portable, which is conceptually not that much different >>>>>>>>> from >>>>>>>>> portable UDFs. >>>>>>>>> >>>>>>>>> >>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of having >>>>>>>>> UDFs in Iceberg, especially not in this early stage. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>> >>>>>>>>> Thanks, Ajantha. >>>>>>>>> >>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked >>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with things >>>>>>>>> that >>>>>>>>> are centralized, like tables of data. While it would be great to have >>>>>>>>> a >>>>>>>>> common set of functions across engines, I don't see how that is >>>>>>>>> practical >>>>>>>>> when those engines are implemented so differently. Plugging in code >>>>>>>>> -- and >>>>>>>>> especially custom user-supplied code -- seems inherently specialized >>>>>>>>> to me >>>>>>>>> and should be part of the engines' design. >>>>>>>>> >>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>> languages, >>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Everyone, >>>>>>>>>> >>>>>>>>>> This is a discussion to gauge the community interest in storing >>>>>>>>>> the Versioned SQL UDFs in Iceberg. >>>>>>>>>> We want to propose the spec addition for storing the versioned >>>>>>>>>> UDFs in Iceberg (inspired by view spec). >>>>>>>>>> >>>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>>> associated with tables, but they can accept arguments and produce >>>>>>>>>> return >>>>>>>>>> values, or even function as inline expressions. >>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks >>>>>>>>>> Spark supports SQL UDFs at catalog level [1]. >>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>>>> understand the UDFs written by other engines (with the translate >>>>>>>>>> layer). >>>>>>>>>> >>>>>>>>>> We believe that integrating this feature into Iceberg would be a >>>>>>>>>> valuable addition, and we're eager to collaborate with the community >>>>>>>>>> to >>>>>>>>>> develop a UDF specification. >>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting a >>>>>>>>>> specification to propose to the community. >>>>>>>>>> >>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> Dremio - >>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>>>> Snowflake - >>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>> Databricks - >>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>> >>>>>>>>>> - Ajantha >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Robert Stupp >>>>>>>>> @snazy >>>>>>>>> >>>>>>>>>