Hi, Walaa, thanks for asking! In the design doc linked before in this thread [1] i read "Without a common standard, the UDFs are hard to share among different engines." ("Background and Motivation" section). I agree with this statement. I don't fully understand yet how the proposed design addresses shareability between the engines though. I would use some help to understand this better.
Best Piotr [1] SQL User-Defined Function Spec https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Piotr, what do you mean by making user-created functions shareable > between engines? Do you mean UDFs written in imperative code? > > On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen > <piotr.findei...@gmail.com> wrote: > > > > Hi, > > > > Thank you Ajantha for creating this thread. The Iceberg UDFs are an > interesting idea! > > Is there a plan to make the user-created functions sharable between the > engines? > > If so, how would a CREATE FUNCTION statement look like in e..g Spark or > Trino? > > > > Meanwhile, added a few comments in the doc. > > > > Best > > Piotr > > > > > > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> > wrote: > >> > >> I just looked through the proposal and added comments. I think it would > be helpful to also have a design doc that covers the choices from the draft > spec. For instance, the choice to enumerate all possible function input > struts rather than allowing generics and varargs. > >> > >> Here’s a quick summary of my feedback: > >> > >> I think that the choice to enumerate function signatures is limiting. > It would be nice to see a discussion of the trade-offs and a rationale for > the choice. I think it would also be very helpful to have a few > representative use cases for this included in the doc. That way the > proposal can demonstrate that it solves those use cases with reasonable > trade-offs. > >> There are a few instances where this is inconsistent with conventions > in other specs. For example, using string IDs rather than an integer. > >> This uses a very different model for spec versioning than the Iceberg > view and table specs. It requires readers to fail if there are any unknown > fields, which prevents the spec from adding things that are fully > backward-compatible. Other Iceberg specs only require a version change to > introduce forward-incompatible changes and I think that this should do the > same to avoid confusion. > >> It looks like the intent is to allow multiple function signatures per > verison, but it is unclear how to encode them because a version is > associated with a single function signature. > >> There is no review of SQL syntax for creating functions across engines, > so this doesn’t show that the metadata proposed is sufficient for > cross-engine use cases. > >> The example for a table-valued function shows a SELECT statement and it > isn’t clear how this is distinct from a view > >> > >> > >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>> > >>> Thanks Walaa and Robert for the review on this. > >>> > >>> We didn't find any blocker for the spec. > >>> I will wait for a week and If no more review comments, I will raise a > PR for spec addition next week. > >>> > >>> If anyone else is interested, please have a look at the proposal > https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit > >>> > >>> - Ajantha > >>> > >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >>>> > >>>> Hi Ajantha, > >>>> > >>>> I have left some comments. It is an interesting direction, but there > might be some details that need to be fine tuned. > >>>> > >>>> The doc is here [1] for others who might be interested. Resharing > since I do not think it was directly linked in the thread. > >>>> > >>>> [1] > https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit > >>>> > >>>> Thanks, > >>>> Walaa. > >>>> > >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>>>> > >>>>> Hi, just another reminder since we didn't get any review on the > proposal. > >>>>> Initially proposed on June 4. > >>>>> > >>>>> - Ajantha > >>>>> > >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>>>>> > >>>>>> Hi everyone, > >>>>>> > >>>>>> We've only received one review so far (from Benny). > >>>>>> > >>>>>> We would appreciate more eyes on this. > >>>>>> > >>>>>> - Ajantha > >>>>>> > >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >>>>>>> > >>>>>>> Hi All, > >>>>>>> Please find the proposal link > >>>>>>> https://github.com/apache/iceberg/issues/10432 > >>>>>>> > >>>>>>> Google doc link is attached in the proposal. > >>>>>>> And Thanks Stephen Lin for working on it. > >>>>>>> > >>>>>>> Hope it gives more clarity to take the decisions and how we want > to implement it. > >>>>>>> > >>>>>>> - Ajantha > >>>>>>> > >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >>>>>>>> > >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined > functions. Here are some examples of what I meant in (2): > >>>>>>>> > >>>>>>>> Hive GenericUDF: > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java > >>>>>>>> Trino user defined functions: > https://trino.io/docs/current/develop/functions.html > >>>>>>>> Flink user defined functions: > https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ > >>>>>>>> > >>>>>>>> Probably what you referred to is a variation of (1) where the API > is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, > that is also possible in the very long run :) > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Walaa. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> > wrote: > >>>>>>>>> > >>>>>>>>> > (2) Custom code written in imperative function according to a > Java/Scala/Python API, etc. > >>>>>>>>> > >>>>>>>>> I think we could still explore some long term opportunities in > this case. Consider you register a Spark temp view as some sort of data > frame read, then it could still be resolved to a Spark plan that is > representable by an intermediate representation. But I agree this gets very > complicated very soon, and just having the case (1) covered would already > be a huge step forward. > >>>>>>>>> > >>>>>>>>> -Jack > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> > wrote: > >>>>>>>>>> > >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to > build a parameterized view. So, there's definitely a lot in common between > UDFs and views. > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> > >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> I think there is a disconnect about what is perceived as a > "UDF". There are 2 flavors: > >>>>>>>>>>> > >>>>>>>>>>> (1) Functions that are defined by the user whose definition is > a composition of other built-in functions/SQL expressions. > >>>>>>>>>>> (2) Custom code written in imperative function according to a > Java/Scala/Python API, etc. > >>>>>>>>>>> > >>>>>>>>>>> All the examples in Ajantha's references are pretty much from > (1) and I think those have more analogy to views due to their SQL nature. > Agree (2) is not practical to maintain by Iceberg, but I think Ajantha's > use cases are around (1), and may be worth evaluating. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Walaa. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < > ajanthab...@gmail.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I > think this would be a very difficult area to tackle across engines, > languages, and memory models without having a huge performance penalty. > >>>>>>>>>>>> > >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of > UDFs (similar to views as shared by the reference links above), the > complexity involved will be similar to managing views. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. > >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the > view spec) this week to facilitate further discussions. > >>>>>>>>>>>> > >>>>>>>>>>>> - Ajantha > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> > wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > While it would be great to have a common set of functions > across engines, I don't see how that is practical when those engines are > implemented so differently. Plugging in code -- and especially custom > user-supplied code -- seems inherently specialized to me and should be part > of the engines' design. > >>>>>>>>>>>>> > >>>>>>>>>>>>> How is this different from the views? I feel we can say > exactly the same thing for Iceberg views, but yet we have Iceberg > multi-dialect views implemented. Maybe it sounds like we are trying to draw > a line between SQL vs other programming language as "code"? but I think SQL > is just another type of code, and we are already talking about compiling > all these different code dialects to an intermediate representation (using > projects like Coral, Substrait), which will be stored as another type of > representation of Iceberg view. I think the same functionality can be used > for UDFs if developed. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just > a multi-dialect one like view, and that can allow engines to for example > parse a view SQL, and when a function referenced cannot be resolved, try to > seek for a multi-dialect UDF definition. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I guess we can discuss more when we have the actual proposal > published. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best, > >>>>>>>>>>>>> Jack Ye > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> > wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> UDFs are as engine specific and portable and > "non-centralized" as views are. The same performance concerns apply to > views as well. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Iceberg should define a common base upon which engines can > build, so the argument that UDFs aren't practical, because engines are > different, is probably only a temporary concern. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the > idea to make views portable, which is conceptually not that much different > from portable UDFs. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of > having UDFs in Iceberg, especially not in this early stage. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, Ajantha. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs > tracked by Iceberg catalogs. I think that Iceberg primarily deals with > things that are centralized, like tables of data. While it would be great > to have a common set of functions across engines, I don't see how that is > practical when those engines are implemented so differently. Plugging in > code -- and especially custom user-supplied code -- seems inherently > specialized to me and should be part of the engines' design. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I > think this would be a very difficult area to tackle across engines, > languages, and memory models without having a huge performance penalty. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Ryan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < > ajanthab...@gmail.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Everyone, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in > storing the Versioned SQL UDFs in Iceberg. > >>>>>>>>>>>>>>> We want to propose the spec addition for storing the > versioned UDFs in Iceberg (inspired by view spec). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are > associated with tables, but they can accept arguments and produce return > values, or even function as inline expressions. > >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, > Databricks Spark supports SQL UDFs at catalog level [1]. > >>>>>>>>>>>>>>> But storing them in Iceberg can enable > >>>>>>>>>>>>>>> - Versioning of these UDFs. > >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially > engines can understand the UDFs written by other engines (with the > translate layer). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg > would be a valuable addition, and we're eager to collaborate with the > community to develop a UDF specification. > >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to > propose to the community. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Let us know your thoughts on this. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>>> Dremio - > https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function > >>>>>>>>>>>>>>> Trino - > https://trino.io/docs/current/sql/create-function.html > >>>>>>>>>>>>>>> Snowflake - > https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions > >>>>>>>>>>>>>>> Databricks - > https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Ajantha > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Ryan Blue > >>>>>>>>>>>>>> Tabular > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Robert Stupp > >>>>>>>>>>>>>> @snazy > >> > >> > >> > >> -- > >> Ryan Blue > >> Databricks >