Coming from PyIceberg, I have concerns as this proposal focuses on SQL-based engines, while Python-based systems often work with data frames. Adding imperative languages like Python would make this proposal more inclusive.
Kind regards, Fokko Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <piotr.findei...@gmail.com >: > Hi, > > Walaa, thanks for asking! > In the design doc linked before in this thread [1] i read > "Without a common standard, the UDFs are hard to share among different > engines." > ("Background and Motivation" section). > I agree with this statement. I don't fully understand yet how the proposed > design addresses shareability between the engines though. > I would use some help to understand this better. > > Best > Piotr > > > > [1] SQL User-Defined Function Spec > https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc > > On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Piotr, what do you mean by making user-created functions shareable >> between engines? Do you mean UDFs written in imperative code? >> >> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >> <piotr.findei...@gmail.com> wrote: >> > >> > Hi, >> > >> > Thank you Ajantha for creating this thread. The Iceberg UDFs are an >> interesting idea! >> > Is there a plan to make the user-created functions sharable between the >> engines? >> > If so, how would a CREATE FUNCTION statement look like in e..g Spark or >> Trino? >> > >> > Meanwhile, added a few comments in the doc. >> > >> > Best >> > Piotr >> > >> > >> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >> >> >> I just looked through the proposal and added comments. I think it >> would be helpful to also have a design doc that covers the choices from the >> draft spec. For instance, the choice to enumerate all possible function >> input struts rather than allowing generics and varargs. >> >> >> >> Here’s a quick summary of my feedback: >> >> >> >> I think that the choice to enumerate function signatures is limiting. >> It would be nice to see a discussion of the trade-offs and a rationale for >> the choice. I think it would also be very helpful to have a few >> representative use cases for this included in the doc. That way the >> proposal can demonstrate that it solves those use cases with reasonable >> trade-offs. >> >> There are a few instances where this is inconsistent with conventions >> in other specs. For example, using string IDs rather than an integer. >> >> This uses a very different model for spec versioning than the Iceberg >> view and table specs. It requires readers to fail if there are any unknown >> fields, which prevents the spec from adding things that are fully >> backward-compatible. Other Iceberg specs only require a version change to >> introduce forward-incompatible changes and I think that this should do the >> same to avoid confusion. >> >> It looks like the intent is to allow multiple function signatures per >> verison, but it is unclear how to encode them because a version is >> associated with a single function signature. >> >> There is no review of SQL syntax for creating functions across >> engines, so this doesn’t show that the metadata proposed is sufficient for >> cross-engine use cases. >> >> The example for a table-valued function shows a SELECT statement and >> it isn’t clear how this is distinct from a view >> >> >> >> >> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> >> >>> Thanks Walaa and Robert for the review on this. >> >>> >> >>> We didn't find any blocker for the spec. >> >>> I will wait for a week and If no more review comments, I will raise a >> PR for spec addition next week. >> >>> >> >>> If anyone else is interested, please have a look at the proposal >> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >> >>> >> >>> - Ajantha >> >>> >> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>> >> >>>> Hi Ajantha, >> >>>> >> >>>> I have left some comments. It is an interesting direction, but there >> might be some details that need to be fine tuned. >> >>>> >> >>>> The doc is here [1] for others who might be interested. Resharing >> since I do not think it was directly linked in the thread. >> >>>> >> >>>> [1] >> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >> >>>> >> >>>> Thanks, >> >>>> Walaa. >> >>>> >> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>>>> >> >>>>> Hi, just another reminder since we didn't get any review on the >> proposal. >> >>>>> Initially proposed on June 4. >> >>>>> >> >>>>> - Ajantha >> >>>>> >> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>>>>> >> >>>>>> Hi everyone, >> >>>>>> >> >>>>>> We've only received one review so far (from Benny). >> >>>>>> >> >>>>>> We would appreciate more eyes on this. >> >>>>>> >> >>>>>> - Ajantha >> >>>>>> >> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>>>>>> >> >>>>>>> Hi All, >> >>>>>>> Please find the proposal link >> >>>>>>> https://github.com/apache/iceberg/issues/10432 >> >>>>>>> >> >>>>>>> Google doc link is attached in the proposal. >> >>>>>>> And Thanks Stephen Lin for working on it. >> >>>>>>> >> >>>>>>> Hope it gives more clarity to take the decisions and how we want >> to implement it. >> >>>>>>> >> >>>>>>> - Ajantha >> >>>>>>> >> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>> >> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user >> defined functions. Here are some examples of what I meant in (2): >> >>>>>>>> >> >>>>>>>> Hive GenericUDF: >> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >> >>>>>>>> Trino user defined functions: >> https://trino.io/docs/current/develop/functions.html >> >>>>>>>> Flink user defined functions: >> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >> >>>>>>>> >> >>>>>>>> Probably what you referred to is a variation of (1) where the >> API is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, >> that is also possible in the very long run :) >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Walaa. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> >> wrote: >> >>>>>>>>> >> >>>>>>>>> > (2) Custom code written in imperative function according to a >> Java/Scala/Python API, etc. >> >>>>>>>>> >> >>>>>>>>> I think we could still explore some long term opportunities in >> this case. Consider you register a Spark temp view as some sort of data >> frame read, then it could still be resolved to a Spark plan that is >> representable by an intermediate representation. But I agree this gets very >> complicated very soon, and just having the case (1) covered would already >> be a huge step forward. >> >>>>>>>>> >> >>>>>>>>> -Jack >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> >> wrote: >> >>>>>>>>>> >> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to >> build a parameterized view. So, there's definitely a lot in common between >> UDFs and views. >> >>>>>>>>>> >> >>>>>>>>>> Thanks >> >>>>>>>>>> >> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> I think there is a disconnect about what is perceived as a >> "UDF". There are 2 flavors: >> >>>>>>>>>>> >> >>>>>>>>>>> (1) Functions that are defined by the user whose definition >> is a composition of other built-in functions/SQL expressions. >> >>>>>>>>>>> (2) Custom code written in imperative function according to a >> Java/Scala/Python API, etc. >> >>>>>>>>>>> >> >>>>>>>>>>> All the examples in Ajantha's references are pretty much from >> (1) and I think those have more analogy to views due to their SQL nature. >> Agree (2) is not practical to maintain by Iceberg, but I think Ajantha's >> use cases are around (1), and may be worth evaluating. >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks, >> >>>>>>>>>>> Walaa. >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >> ajanthab...@gmail.com> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I >> think this would be a very difficult area to tackle across engines, >> languages, and memory models without having a huge performance penalty. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of >> UDFs (similar to views as shared by the reference links above), the >> complexity involved will be similar to managing views. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the >> view spec) this week to facilitate further discussions. >> >>>>>>>>>>>> >> >>>>>>>>>>>> - Ajantha >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> >> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> > While it would be great to have a common set of functions >> across engines, I don't see how that is practical when those engines are >> implemented so differently. Plugging in code -- and especially custom >> user-supplied code -- seems inherently specialized to me and should be part >> of the engines' design. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> How is this different from the views? I feel we can say >> exactly the same thing for Iceberg views, but yet we have Iceberg >> multi-dialect views implemented. Maybe it sounds like we are trying to draw >> a line between SQL vs other programming language as "code"? but I think SQL >> is just another type of code, and we are already talking about compiling >> all these different code dialects to an intermediate representation (using >> projects like Coral, Substrait), which will be stored as another type of >> representation of Iceberg view. I think the same functionality can be used >> for UDFs if developed. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even >> just a multi-dialect one like view, and that can allow engines to for >> example parse a view SQL, and when a function referenced cannot be >> resolved, try to seek for a multi-dialect UDF definition. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I guess we can discuss more when we have the actual >> proposal published. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Best, >> >>>>>>>>>>>>> Jack Ye >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp < >> sn...@snazy.de> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> UDFs are as engine specific and portable and >> "non-centralized" as views are. The same performance concerns apply to >> views as well. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines can >> build, so the argument that UDFs aren't practical, because engines are >> different, is probably only a temporary concern. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the >> idea to make views portable, which is conceptually not that much different >> from portable UDFs. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea >> of having UDFs in Iceberg, especially not in this early stage. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks, Ajantha. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs >> tracked by Iceberg catalogs. I think that Iceberg primarily deals with >> things that are centralized, like tables of data. While it would be great >> to have a common set of functions across engines, I don't see how that is >> practical when those engines are implemented so differently. Plugging in >> code -- and especially custom user-supplied code -- seems inherently >> specialized to me and should be part of the engines' design. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I >> think this would be a very difficult area to tackle across engines, >> languages, and memory models without having a huge performance penalty. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Ryan >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >> ajanthab...@gmail.com> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Hi Everyone, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in >> storing the Versioned SQL UDFs in Iceberg. >> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the >> versioned UDFs in Iceberg (inspired by view spec). >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they >> are associated with tables, but they can accept arguments and produce >> return values, or even function as inline expressions. >> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, >> Databricks Spark supports SQL UDFs at catalog level [1]. >> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >> >>>>>>>>>>>>>>> - Versioning of these UDFs. >> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially >> engines can understand the UDFs written by other engines (with the >> translate layer). >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg >> would be a valuable addition, and we're eager to collaborate with the >> community to develop a UDF specification. >> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to >> propose to the community. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Let us know your thoughts on this. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> [1] >> >>>>>>>>>>>>>>> Dremio - >> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >> >>>>>>>>>>>>>>> Trino - >> https://trino.io/docs/current/sql/create-function.html >> >>>>>>>>>>>>>>> Snowflake - >> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >> >>>>>>>>>>>>>>> Databricks - >> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> - Ajantha >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> Ryan Blue >> >>>>>>>>>>>>>> Tabular >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> Robert Stupp >> >>>>>>>>>>>>>> @snazy >> >> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Databricks >> >