I do not think the spec is meant to allow only SQL representations, although it is certainly faviouring SQL in examples... It would be nice to add a non-SQL example, indeed.
Cheers, Dmitri. On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <fo...@apache.org> wrote: > Coming from PyIceberg, I have concerns as this proposal focuses on > SQL-based engines, while Python-based systems often work with data frames. > Adding imperative languages like Python would make this proposal more > inclusive. > > Kind regards, > Fokko > > > > Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < > piotr.findei...@gmail.com>: > >> Hi, >> >> Walaa, thanks for asking! >> In the design doc linked before in this thread [1] i read >> "Without a common standard, the UDFs are hard to share among different >> engines." >> ("Background and Motivation" section). >> I agree with this statement. I don't fully understand yet how the >> proposed design addresses shareability between the engines though. >> I would use some help to understand this better. >> >> Best >> Piotr >> >> >> >> [1] SQL User-Defined Function Spec >> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >> >> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <wa.moust...@gmail.com> >> wrote: >> >>> Piotr, what do you mean by making user-created functions shareable >>> between engines? Do you mean UDFs written in imperative code? >>> >>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>> <piotr.findei...@gmail.com> wrote: >>> > >>> > Hi, >>> > >>> > Thank you Ajantha for creating this thread. The Iceberg UDFs are an >>> interesting idea! >>> > Is there a plan to make the user-created functions sharable between >>> the engines? >>> > If so, how would a CREATE FUNCTION statement look like in e..g Spark >>> or Trino? >>> > >>> > Meanwhile, added a few comments in the doc. >>> > >>> > Best >>> > Piotr >>> > >>> > >>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> >>> wrote: >>> >> >>> >> I just looked through the proposal and added comments. I think it >>> would be helpful to also have a design doc that covers the choices from the >>> draft spec. For instance, the choice to enumerate all possible function >>> input struts rather than allowing generics and varargs. >>> >> >>> >> Here’s a quick summary of my feedback: >>> >> >>> >> I think that the choice to enumerate function signatures is limiting. >>> It would be nice to see a discussion of the trade-offs and a rationale for >>> the choice. I think it would also be very helpful to have a few >>> representative use cases for this included in the doc. That way the >>> proposal can demonstrate that it solves those use cases with reasonable >>> trade-offs. >>> >> There are a few instances where this is inconsistent with conventions >>> in other specs. For example, using string IDs rather than an integer. >>> >> This uses a very different model for spec versioning than the Iceberg >>> view and table specs. It requires readers to fail if there are any unknown >>> fields, which prevents the spec from adding things that are fully >>> backward-compatible. Other Iceberg specs only require a version change to >>> introduce forward-incompatible changes and I think that this should do the >>> same to avoid confusion. >>> >> It looks like the intent is to allow multiple function signatures per >>> verison, but it is unclear how to encode them because a version is >>> associated with a single function signature. >>> >> There is no review of SQL syntax for creating functions across >>> engines, so this doesn’t show that the metadata proposed is sufficient for >>> cross-engine use cases. >>> >> The example for a table-valued function shows a SELECT statement and >>> it isn’t clear how this is distinct from a view >>> >> >>> >> >>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>> >>> >>> Thanks Walaa and Robert for the review on this. >>> >>> >>> >>> We didn't find any blocker for the spec. >>> >>> I will wait for a week and If no more review comments, I will raise >>> a PR for spec addition next week. >>> >>> >>> >>> If anyone else is interested, please have a look at the proposal >>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>> >>> >>> >>> - Ajantha >>> >>> >>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> >>> >>>> Hi Ajantha, >>> >>>> >>> >>>> I have left some comments. It is an interesting direction, but >>> there might be some details that need to be fine tuned. >>> >>>> >>> >>>> The doc is here [1] for others who might be interested. Resharing >>> since I do not think it was directly linked in the thread. >>> >>>> >>> >>>> [1] >>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>> >>>> >>> >>>> Thanks, >>> >>>> Walaa. >>> >>>> >>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>> ajanthab...@gmail.com> wrote: >>> >>>>> >>> >>>>> Hi, just another reminder since we didn't get any review on the >>> proposal. >>> >>>>> Initially proposed on June 4. >>> >>>>> >>> >>>>> - Ajantha >>> >>>>> >>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>> ajanthab...@gmail.com> wrote: >>> >>>>>> >>> >>>>>> Hi everyone, >>> >>>>>> >>> >>>>>> We've only received one review so far (from Benny). >>> >>>>>> >>> >>>>>> We would appreciate more eyes on this. >>> >>>>>> >>> >>>>>> - Ajantha >>> >>>>>> >>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>> ajanthab...@gmail.com> wrote: >>> >>>>>>> >>> >>>>>>> Hi All, >>> >>>>>>> Please find the proposal link >>> >>>>>>> https://github.com/apache/iceberg/issues/10432 >>> >>>>>>> >>> >>>>>>> Google doc link is attached in the proposal. >>> >>>>>>> And Thanks Stephen Lin for working on it. >>> >>>>>>> >>> >>>>>>> Hope it gives more clarity to take the decisions and how we want >>> to implement it. >>> >>>>>>> >>> >>>>>>> - Ajantha >>> >>>>>>> >>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>>>>>> >>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user >>> defined functions. Here are some examples of what I meant in (2): >>> >>>>>>>> >>> >>>>>>>> Hive GenericUDF: >>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>> >>>>>>>> Trino user defined functions: >>> https://trino.io/docs/current/develop/functions.html >>> >>>>>>>> Flink user defined functions: >>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>> >>>>>>>> >>> >>>>>>>> Probably what you referred to is a variation of (1) where the >>> API is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, >>> that is also possible in the very long run :) >>> >>>>>>>> >>> >>>>>>>> Thanks, >>> >>>>>>>> Walaa. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> >>> wrote: >>> >>>>>>>>> >>> >>>>>>>>> > (2) Custom code written in imperative function according to >>> a Java/Scala/Python API, etc. >>> >>>>>>>>> >>> >>>>>>>>> I think we could still explore some long term opportunities in >>> this case. Consider you register a Spark temp view as some sort of data >>> frame read, then it could still be resolved to a Spark plan that is >>> representable by an intermediate representation. But I agree this gets very >>> complicated very soon, and just having the case (1) covered would already >>> be a huge step forward. >>> >>>>>>>>> >>> >>>>>>>>> -Jack >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com> >>> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used >>> to build a parameterized view. So, there's definitely a lot in common >>> between UDFs and views. >>> >>>>>>>>>> >>> >>>>>>>>>> Thanks >>> >>>>>>>>>> >>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> I think there is a disconnect about what is perceived as a >>> "UDF". There are 2 flavors: >>> >>>>>>>>>>> >>> >>>>>>>>>>> (1) Functions that are defined by the user whose definition >>> is a composition of other built-in functions/SQL expressions. >>> >>>>>>>>>>> (2) Custom code written in imperative function according to >>> a Java/Scala/Python API, etc. >>> >>>>>>>>>>> >>> >>>>>>>>>>> All the examples in Ajantha's references are pretty much >>> from (1) and I think those have more analogy to views due to their SQL >>> nature. Agree (2) is not practical to maintain by Iceberg, but I think >>> Ajantha's use cases are around (1), and may be worth evaluating. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Thanks, >>> >>>>>>>>>>> Walaa. >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat < >>> ajanthab...@gmail.com> wrote: >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I >>> think this would be a very difficult area to tackle across engines, >>> languages, and memory models without having a huge performance penalty. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of >>> UDFs (similar to views as shared by the reference links above), the >>> complexity involved will be similar to managing views. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input. >>> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the >>> view spec) this week to facilitate further discussions. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> - Ajantha >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>> yezhao...@gmail.com> wrote: >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> > While it would be great to have a common set of >>> functions across engines, I don't see how that is practical when those >>> engines are implemented so differently. Plugging in code -- and especially >>> custom user-supplied code -- seems inherently specialized to me and should >>> be part of the engines' design. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> How is this different from the views? I feel we can say >>> exactly the same thing for Iceberg views, but yet we have Iceberg >>> multi-dialect views implemented. Maybe it sounds like we are trying to draw >>> a line between SQL vs other programming language as "code"? but I think SQL >>> is just another type of code, and we are already talking about compiling >>> all these different code dialects to an intermediate representation (using >>> projects like Coral, Substrait), which will be stored as another type of >>> representation of Iceberg view. I think the same functionality can be used >>> for UDFs if developed. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even >>> just a multi-dialect one like view, and that can allow engines to for >>> example parse a view SQL, and when a function referenced cannot be >>> resolved, try to seek for a multi-dialect UDF definition. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> I guess we can discuss more when we have the actual >>> proposal published. >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> Best, >>> >>>>>>>>>>>>> Jack Ye >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp < >>> sn...@snazy.de> wrote: >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and >>> "non-centralized" as views are. The same performance concerns apply to >>> views as well. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines >>> can build, so the argument that UDFs aren't practical, because engines are >>> different, is probably only a temporary concern. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the >>> idea to make views portable, which is conceptually not that much different >>> from portable UDFs. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea >>> of having UDFs in Iceberg, especially not in this early stage. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Thanks, Ajantha. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs >>> tracked by Iceberg catalogs. I think that Iceberg primarily deals with >>> things that are centralized, like tables of data. While it would be great >>> to have a common set of functions across engines, I don't see how that is >>> practical when those engines are implemented so differently. Plugging in >>> code -- and especially custom user-supplied code -- seems inherently >>> specialized to me and should be part of the engines' design. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I >>> think this would be a very difficult area to tackle across engines, >>> languages, and memory models without having a huge performance penalty. >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> Ryan >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat < >>> ajanthab...@gmail.com> wrote: >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> Hi Everyone, >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in >>> storing the Versioned SQL UDFs in Iceberg. >>> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the >>> versioned UDFs in Iceberg (inspired by view spec). >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they >>> are associated with tables, but they can accept arguments and produce >>> return values, or even function as inline expressions. >>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, >>> Databricks Spark supports SQL UDFs at catalog level [1]. >>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially >>> engines can understand the UDFs written by other engines (with the >>> translate layer). >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg >>> would be a valuable addition, and we're eager to collaborate with the >>> community to develop a UDF specification. >>> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to >>> propose to the community. >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> [1] >>> >>>>>>>>>>>>>>> Dremio - >>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>> >>>>>>>>>>>>>>> Trino - >>> https://trino.io/docs/current/sql/create-function.html >>> >>>>>>>>>>>>>>> Snowflake - >>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>> >>>>>>>>>>>>>>> Databricks - >>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>> >>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>> - Ajantha >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> -- >>> >>>>>>>>>>>>>> Ryan Blue >>> >>>>>>>>>>>>>> Tabular >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> -- >>> >>>>>>>>>>>>>> Robert Stupp >>> >>>>>>>>>>>>>> @snazy >>> >> >>> >> >>> >> >>> >> -- >>> >> Ryan Blue >>> >> Databricks >>> >>