Piotr, what do you mean by making user-created functions shareable between engines? Do you mean UDFs written in imperative code?
On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen <piotr.findei...@gmail.com> wrote: > > Hi, > > Thank you Ajantha for creating this thread. The Iceberg UDFs are an > interesting idea! > Is there a plan to make the user-created functions sharable between the > engines? > If so, how would a CREATE FUNCTION statement look like in e..g Spark or Trino? > > Meanwhile, added a few comments in the doc. > > Best > Piotr > > > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> wrote: >> >> I just looked through the proposal and added comments. I think it would be >> helpful to also have a design doc that covers the choices from the draft >> spec. For instance, the choice to enumerate all possible function input >> struts rather than allowing generics and varargs. >> >> Here’s a quick summary of my feedback: >> >> I think that the choice to enumerate function signatures is limiting. It >> would be nice to see a discussion of the trade-offs and a rationale for the >> choice. I think it would also be very helpful to have a few representative >> use cases for this included in the doc. That way the proposal can >> demonstrate that it solves those use cases with reasonable trade-offs. >> There are a few instances where this is inconsistent with conventions in >> other specs. For example, using string IDs rather than an integer. >> This uses a very different model for spec versioning than the Iceberg view >> and table specs. It requires readers to fail if there are any unknown >> fields, which prevents the spec from adding things that are fully >> backward-compatible. Other Iceberg specs only require a version change to >> introduce forward-incompatible changes and I think that this should do the >> same to avoid confusion. >> It looks like the intent is to allow multiple function signatures per >> verison, but it is unclear how to encode them because a version is >> associated with a single function signature. >> There is no review of SQL syntax for creating functions across engines, so >> this doesn’t show that the metadata proposed is sufficient for cross-engine >> use cases. >> The example for a table-valued function shows a SELECT statement and it >> isn’t clear how this is distinct from a view >> >> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: >>> >>> Thanks Walaa and Robert for the review on this. >>> >>> We didn't find any blocker for the spec. >>> I will wait for a week and If no more review comments, I will raise a PR >>> for spec addition next week. >>> >>> If anyone else is interested, please have a look at the proposal >>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>> >>> - Ajantha >>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa >>> <wa.moust...@gmail.com> wrote: >>>> >>>> Hi Ajantha, >>>> >>>> I have left some comments. It is an interesting direction, but there might >>>> be some details that need to be fine tuned. >>>> >>>> The doc is here [1] for others who might be interested. Resharing since I >>>> do not think it was directly linked in the thread. >>>> >>>> [1] >>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>> >>>> Thanks, >>>> Walaa. >>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>>> >>>>> Hi, just another reminder since we didn't get any review on the proposal. >>>>> Initially proposed on June 4. >>>>> >>>>> - Ajantha >>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> We've only received one review so far (from Benny). >>>>>> >>>>>> We would appreciate more eyes on this. >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> Please find the proposal link >>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>> >>>>>>> Hope it gives more clarity to take the decisions and how we want to >>>>>>> implement it. >>>>>>> >>>>>>> - Ajantha >>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa >>>>>>> <wa.moust...@gmail.com> wrote: >>>>>>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined >>>>>>>> functions. Here are some examples of what I meant in (2): >>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>> Trino user defined functions: >>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>> Flink user defined functions: >>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>> >>>>>>>> Probably what you referred to is a variation of (1) where the API is >>>>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, >>>>>>>> that is also possible in the very long run :) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Walaa. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative function according to a >>>>>>>>> > Java/Scala/Python API, etc. >>>>>>>>> >>>>>>>>> I think we could still explore some long term opportunities in this >>>>>>>>> case. Consider you register a Spark temp view as some sort of data >>>>>>>>> frame read, then it could still be resolved to a Spark plan that is >>>>>>>>> representable by an intermediate representation. But I agree this >>>>>>>>> gets very complicated very soon, and just having the case (1) covered >>>>>>>>> would already be a huge step forward. >>>>>>>>> >>>>>>>>> -Jack >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to build >>>>>>>>>> a parameterized view. So, there's definitely a lot in common >>>>>>>>>> between UDFs and views. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa >>>>>>>>>> <wa.moust...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>>>>>>>> There are 2 flavors: >>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the user whose definition is a >>>>>>>>>>> composition of other built-in functions/SQL expressions. >>>>>>>>>>> (2) Custom code written in imperative function according to a >>>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references are pretty much from (1) >>>>>>>>>>> and I think those have more analogy to views due to their SQL >>>>>>>>>>> nature. Agree (2) is not practical to maintain by Iceberg, but I >>>>>>>>>>> think Ajantha's use cases are around (1), and may be worth >>>>>>>>>>> evaluating. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Walaa. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat >>>>>>>>>>> <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>>>>> languages, and memory models without having a huge performance >>>>>>>>>>>>> penalty. >>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>>>>>>>> (similar to views as shared by the reference links above), the >>>>>>>>>>>> complexity involved will be similar to managing views. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the view >>>>>>>>>>>> spec) this week to facilitate further discussions. >>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a common set of functions >>>>>>>>>>>>> > across engines, I don't see how that is practical when those >>>>>>>>>>>>> > engines are implemented so differently. Plugging in code -- and >>>>>>>>>>>>> > especially custom user-supplied code -- seems inherently >>>>>>>>>>>>> > specialized to me and should be part of the engines' design. >>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I feel we can say exactly >>>>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg >>>>>>>>>>>>> multi-dialect views implemented. Maybe it sounds like we are >>>>>>>>>>>>> trying to draw a line between SQL vs other programming language >>>>>>>>>>>>> as "code"? but I think SQL is just another type of code, and we >>>>>>>>>>>>> are already talking about compiling all these different code >>>>>>>>>>>>> dialects to an intermediate representation (using projects like >>>>>>>>>>>>> Coral, Substrait), which will be stored as another type of >>>>>>>>>>>>> representation of Iceberg view. I think the same functionality >>>>>>>>>>>>> can be used for UDFs if developed. >>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>>>>>>>> multi-dialect one like view, and that can allow engines to for >>>>>>>>>>>>> example parse a view SQL, and when a function referenced cannot >>>>>>>>>>>>> be resolved, try to seek for a multi-dialect UDF definition. >>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>>>>>>>> published. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" >>>>>>>>>>>>>> as views are. The same performance concerns apply to views as >>>>>>>>>>>>>> well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines can >>>>>>>>>>>>>> build, so the argument that UDFs aren't practical, because >>>>>>>>>>>>>> engines are different, is probably only a temporary concern. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to >>>>>>>>>>>>>> make views portable, which is conceptually not that much >>>>>>>>>>>>>> different from portable UDFs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of >>>>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs tracked >>>>>>>>>>>>>> by Iceberg catalogs. I think that Iceberg primarily deals with >>>>>>>>>>>>>> things that are centralized, like tables of data. While it would >>>>>>>>>>>>>> be great to have a common set of functions across engines, I >>>>>>>>>>>>>> don't see how that is practical when those engines are >>>>>>>>>>>>>> implemented so differently. Plugging in code -- and especially >>>>>>>>>>>>>> custom user-supplied code -- seems inherently specialized to me >>>>>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>>>>>> languages, and memory models without having a huge performance >>>>>>>>>>>>>> penalty. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat >>>>>>>>>>>>>> <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in storing >>>>>>>>>>>>>>> the Versioned SQL UDFs in Iceberg. >>>>>>>>>>>>>>> We want to propose the spec addition for storing the versioned >>>>>>>>>>>>>>> UDFs in Iceberg (inspired by view spec). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>>>>>>>> associated with tables, but they can accept arguments and >>>>>>>>>>>>>>> produce return values, or even function as inline expressions. >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks >>>>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1]. >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially engines can >>>>>>>>>>>>>>> understand the UDFs written by other engines (with the >>>>>>>>>>>>>>> translate layer). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg would be >>>>>>>>>>>>>>> a valuable addition, and we're eager to collaborate with the >>>>>>>>>>>>>>> community to develop a UDF specification. >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to propose >>>>>>>>>>>>>>> to the community. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>> @snazy >> >> >> >> -- >> Ryan Blue >> Databricks