Hi, Thank you Ajantha for creating this thread. The Iceberg UDFs are an interesting idea! Is there a plan to make the user-created functions sharable between the engines? If so, how would a CREATE FUNCTION statement look like in e..g Spark or Trino?
Meanwhile, added a few comments in the doc. Best Piotr On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> wrote: > I just looked through the proposal and added comments. I think it would be > helpful to also have a design doc that covers the choices from the draft > spec. For instance, the choice to enumerate all possible function input > struts rather than allowing generics and varargs. > > Here’s a quick summary of my feedback: > > - I think that the choice to enumerate function signatures is > limiting. It would be nice to see a discussion of the trade-offs and a > rationale for the choice. I think it would also be very helpful to have a > few representative use cases for this included in the doc. That way the > proposal can demonstrate that it solves those use cases with reasonable > trade-offs. > - There are a few instances where this is inconsistent with > conventions in other specs. For example, using string IDs rather than an > integer. > - This uses a very different model for spec versioning than the > Iceberg view and table specs. It requires readers to fail if there are any > unknown fields, which prevents the spec from adding things that are fully > backward-compatible. Other Iceberg specs only require a version change to > introduce forward-incompatible changes and I think that this should do the > same to avoid confusion. > - It looks like the intent is to allow multiple function signatures > per verison, but it is unclear how to encode them because a version is > associated with a single function signature. > - There is no review of SQL syntax for creating functions across > engines, so this doesn’t show that the metadata proposed is sufficient for > cross-engine use cases. > - The example for a table-valued function shows a SELECT statement and > it isn’t clear how this is distinct from a view > > > On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > >> Thanks Walaa and Robert for the review on this. >> >> We didn't find any blocker for the spec. >> I will wait for a week and If no more review comments, I will raise a PR >> for spec addition next week. >> >> If anyone else is interested, please have a look at the proposal >> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >> >> - Ajantha >> >> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Hi Ajantha, >>> >>> I have left some comments. It is an interesting direction, but there >>> might be some details that need to be fine tuned. >>> >>> The doc is here [1] for others who might be interested. Resharing since >>> I do not think it was directly linked in the thread. >>> >>> [1] >>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>> >>> Thanks, >>> Walaa. >>> >>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Hi, just another reminder since we didn't get any review on the >>>> proposal. >>>> Initially proposed on June 4. >>>> >>>> - Ajantha >>>> >>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> We've only received one review so far (from Benny). >>>>> >>>>> We would appreciate more eyes on this. >>>>> >>>>> - Ajantha >>>>> >>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> Please find the proposal link >>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>> >>>>>> Google doc link is attached in the proposal. >>>>>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it. >>>>>> >>>>>> Hope it gives more clarity to take the decisions and how we want to >>>>>> implement it. >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined >>>>>>> functions. Here are some examples of what I meant in (2): >>>>>>> >>>>>>> Hive GenericUDF: >>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>> Trino user defined functions: >>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>> Flink user defined functions: >>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>> >>>>>>> Probably what you referred to is a variation of (1) where the API is >>>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, >>>>>>> that >>>>>>> is also possible in the very long run :) >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>> > (2) Custom code written in imperative function according to a >>>>>>>> Java/Scala/Python API, etc. >>>>>>>> >>>>>>>> I think we could still explore some long term opportunities in this >>>>>>>> case. Consider you register a Spark temp view as some sort of data >>>>>>>> frame >>>>>>>> read, then it could still be resolved to a Spark plan that is >>>>>>>> representable >>>>>>>> by an intermediate representation. But I agree this gets very >>>>>>>> complicated >>>>>>>> very soon, and just having the case (1) covered would already be a huge >>>>>>>> step forward. >>>>>>>> >>>>>>>> -Jack >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to >>>>>>>>> build a *parameterized *view. So, there's definitely a lot in >>>>>>>>> common between UDFs and views. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I think there is a disconnect about what is perceived as a "UDF". >>>>>>>>>> There are 2 flavors: >>>>>>>>>> >>>>>>>>>> (1) Functions that are defined by the user whose definition is a >>>>>>>>>> composition of other built-in functions/SQL expressions. >>>>>>>>>> (2) Custom code written in imperative function according to a >>>>>>>>>> Java/Scala/Python API, etc. >>>>>>>>>> >>>>>>>>>> All the examples in Ajantha's references are pretty much from (1) >>>>>>>>>> and I think those have more analogy to views due to their SQL >>>>>>>>>> nature. Agree >>>>>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's >>>>>>>>>> use >>>>>>>>>> cases are around (1), and may be worth evaluating. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I guess we'll know more when you post the proposal, but I think >>>>>>>>>>>> this would be a very difficult area to tackle across engines, >>>>>>>>>>>> languages, >>>>>>>>>>>> and memory models without having a huge performance penalty. >>>>>>>>>>> >>>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs >>>>>>>>>>> (similar to views as shared by the reference links above), the >>>>>>>>>>> complexity >>>>>>>>>>> involved will be similar to managing views. >>>>>>>>>>> >>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>>>>>>>>>> We will work on publishing the draft spec (inspired by the view >>>>>>>>>>> spec) this week to facilitate further discussions. >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> > While it would be great to have a common set of functions >>>>>>>>>>>> across engines, I don't see how that is practical when those >>>>>>>>>>>> engines are >>>>>>>>>>>> implemented so differently. Plugging in code -- and especially >>>>>>>>>>>> custom >>>>>>>>>>>> user-supplied code -- seems inherently specialized to me and >>>>>>>>>>>> should be part >>>>>>>>>>>> of the engines' design. >>>>>>>>>>>> >>>>>>>>>>>> How is this different from the views? I feel we can say exactly >>>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg >>>>>>>>>>>> multi-dialect >>>>>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a >>>>>>>>>>>> line >>>>>>>>>>>> between SQL vs other programming language as "code"? but I think >>>>>>>>>>>> SQL is >>>>>>>>>>>> just another type of code, and we are already talking about >>>>>>>>>>>> compiling all >>>>>>>>>>>> these different code dialects to an intermediate representation >>>>>>>>>>>> (using >>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another >>>>>>>>>>>> type of >>>>>>>>>>>> representation of Iceberg view. I think the same functionality can >>>>>>>>>>>> be used >>>>>>>>>>>> for UDFs if developed. >>>>>>>>>>>> >>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a >>>>>>>>>>>> multi-dialect one like view, and that can allow engines to for >>>>>>>>>>>> example >>>>>>>>>>>> parse a view SQL, and when a function referenced cannot be >>>>>>>>>>>> resolved, try to >>>>>>>>>>>> seek for a multi-dialect UDF definition. >>>>>>>>>>>> >>>>>>>>>>>> I guess we can discuss more when we have the actual proposal >>>>>>>>>>>> published. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Jack Ye >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized" >>>>>>>>>>>>> as views are. The same performance concerns apply to views as >>>>>>>>>>>>> well. >>>>>>>>>>>>> Iceberg should define a common base upon which engines can >>>>>>>>>>>>> build, so the argument that UDFs aren't practical, because >>>>>>>>>>>>> engines are >>>>>>>>>>>>> different, is probably only a temporary concern. >>>>>>>>>>>>> >>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea >>>>>>>>>>>>> to make views portable, which is conceptually not that much >>>>>>>>>>>>> different from >>>>>>>>>>>>> portable UDFs. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of >>>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs >>>>>>>>>>>>> tracked by Iceberg catalogs. I think that Iceberg primarily deals >>>>>>>>>>>>> with >>>>>>>>>>>>> things that are centralized, like tables of data. While it would >>>>>>>>>>>>> be great >>>>>>>>>>>>> to have a common set of functions across engines, I don't see how >>>>>>>>>>>>> that is >>>>>>>>>>>>> practical when those engines are implemented so differently. >>>>>>>>>>>>> Plugging in >>>>>>>>>>>>> code -- and especially custom user-supplied code -- seems >>>>>>>>>>>>> inherently >>>>>>>>>>>>> specialized to me and should be part of the engines' design. >>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I >>>>>>>>>>>>> think this would be a very difficult area to tackle across >>>>>>>>>>>>> engines, >>>>>>>>>>>>> languages, and memory models without having a huge performance >>>>>>>>>>>>> penalty. >>>>>>>>>>>>> >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is a discussion to gauge the community interest in >>>>>>>>>>>>>> storing the Versioned SQL UDFs in Iceberg. >>>>>>>>>>>>>> We want to propose the spec addition for storing the >>>>>>>>>>>>>> versioned UDFs in Iceberg (inspired by view spec). >>>>>>>>>>>>>> >>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are >>>>>>>>>>>>>> associated with tables, but they can accept arguments and >>>>>>>>>>>>>> produce return >>>>>>>>>>>>>> values, or even function as inline expressions. >>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks >>>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1]. >>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>> - Interoperability between the engines. Potentially engines >>>>>>>>>>>>>> can understand the UDFs written by other engines (with the >>>>>>>>>>>>>> translate layer). >>>>>>>>>>>>>> >>>>>>>>>>>>>> We believe that integrating this feature into Iceberg would >>>>>>>>>>>>>> be a valuable addition, and we're eager to collaborate with the >>>>>>>>>>>>>> community >>>>>>>>>>>>>> to develop a UDF specification. >>>>>>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting >>>>>>>>>>>>>> a specification to propose to the community. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Tabular >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>> @snazy >>>>>>>>>>>>> >>>>>>>>>>>>> > > -- > Ryan Blue > Databricks >