Coming from PyIceberg, I have concerns as this proposal focuses on
SQL-based engines, while Python-based systems often work with data frames.
Adding imperative languages like Python would make this proposal more
inclusive.

Kind regards,
Fokko



Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <piotr.findei...@gmail.com
>:

> Hi,
>
> Walaa, thanks for asking!
> In the design doc linked before  in this thread [1] i read
> "Without a common standard, the UDFs are hard to share among different
> engines."
> ("Background and Motivation" section).
> I agree with this statement. I don't fully understand yet how the proposed
> design addresses shareability between the engines though.
> I would use some help to understand this better.
>
> Best
> Piotr
>
>
>
> [1] SQL User-Defined Function Spec
> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>
> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
>> Piotr, what do you mean by making user-created functions shareable
>> between engines? Do you mean UDFs written in imperative code?
>>
>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>> <piotr.findei...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Thank you Ajantha for creating this thread. The Iceberg UDFs are an
>> interesting idea!
>> > Is there a plan to make the user-created functions sharable between the
>> engines?
>> > If so, how would a CREATE FUNCTION statement look like in e..g Spark or
>> Trino?
>> >
>> > Meanwhile, added a few comments in the doc.
>> >
>> > Best
>> > Piotr
>> >
>> >
>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>> >>
>> >> I just looked through the proposal and added comments. I think it
>> would be helpful to also have a design doc that covers the choices from the
>> draft spec. For instance, the choice to enumerate all possible function
>> input struts rather than allowing generics and varargs.
>> >>
>> >> Here’s a quick summary of my feedback:
>> >>
>> >> I think that the choice to enumerate function signatures is limiting.
>> It would be nice to see a discussion of the trade-offs and a rationale for
>> the choice. I think it would also be very helpful to have a few
>> representative use cases for this included in the doc. That way the
>> proposal can demonstrate that it solves those use cases with reasonable
>> trade-offs.
>> >> There are a few instances where this is inconsistent with conventions
>> in other specs. For example, using string IDs rather than an integer.
>> >> This uses a very different model for spec versioning than the Iceberg
>> view and table specs. It requires readers to fail if there are any unknown
>> fields, which prevents the spec from adding things that are fully
>> backward-compatible. Other Iceberg specs only require a version change to
>> introduce forward-incompatible changes and I think that this should do the
>> same to avoid confusion.
>> >> It looks like the intent is to allow multiple function signatures per
>> verison, but it is unclear how to encode them because a version is
>> associated with a single function signature.
>> >> There is no review of SQL syntax for creating functions across
>> engines, so this doesn’t show that the metadata proposed is sufficient for
>> cross-engine use cases.
>> >> The example for a table-valued function shows a SELECT statement and
>> it isn’t clear how this is distinct from a view
>> >>
>> >>
>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>> >>>
>> >>> Thanks Walaa and Robert for the review on this.
>> >>>
>> >>> We didn't find any blocker for the spec.
>> >>> I will wait for a week and If no more review comments, I will raise a
>> PR for spec addition next week.
>> >>>
>> >>> If anyone else is interested, please have a look at the proposal
>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>> >>>
>> >>> - Ajantha
>> >>>
>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>
>> >>>> Hi Ajantha,
>> >>>>
>> >>>> I have left some comments. It is an interesting direction, but there
>> might be some details that need to be fine tuned.
>> >>>>
>> >>>> The doc is here [1] for others who might be interested. Resharing
>> since I do not think it was directly linked in the thread.
>> >>>>
>> >>>> [1]
>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>> >>>>
>> >>>> Thanks,
>> >>>> Walaa.
>> >>>>
>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Hi, just another reminder since we didn't get any review on the
>> proposal.
>> >>>>> Initially proposed on June 4.
>> >>>>>
>> >>>>> - Ajantha
>> >>>>>
>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> Hi everyone,
>> >>>>>>
>> >>>>>> We've only received one review so far (from Benny).
>> >>>>>>
>> >>>>>> We would appreciate more eyes on this.
>> >>>>>>
>> >>>>>> - Ajantha
>> >>>>>>
>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> Hi All,
>> >>>>>>> Please find the proposal link
>> >>>>>>> https://github.com/apache/iceberg/issues/10432
>> >>>>>>>
>> >>>>>>> Google doc link is attached in the proposal.
>> >>>>>>> And Thanks Stephen Lin for working on it.
>> >>>>>>>
>> >>>>>>> Hope it gives more clarity to take the decisions and how we want
>> to implement it.
>> >>>>>>>
>> >>>>>>> - Ajantha
>> >>>>>>>
>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user
>> defined functions. Here are some examples of what I meant in (2):
>> >>>>>>>>
>> >>>>>>>> Hive GenericUDF:
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>> >>>>>>>> Trino user defined functions:
>> https://trino.io/docs/current/develop/functions.html
>> >>>>>>>> Flink user defined functions:
>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>> >>>>>>>>
>> >>>>>>>> Probably what you referred to is a variation of (1) where the
>> API is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes,
>> that is also possible in the very long run :)
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Walaa.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com>
>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> > (2) Custom code written in imperative function according to a
>> Java/Scala/Python API, etc.
>> >>>>>>>>>
>> >>>>>>>>> I think we could still explore some long term opportunities in
>> this case. Consider you register a Spark temp view as some sort of data
>> frame read, then it could still be resolved to a Spark plan that is
>> representable by an intermediate representation. But I agree this gets very
>> complicated very soon, and just having the case (1) covered would already
>> be a huge step forward.
>> >>>>>>>>>
>> >>>>>>>>> -Jack
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com>
>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to
>> build a parameterized view.  So, there's definitely a lot in common between
>> UDFs and views.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> I think there is a disconnect about what is perceived as a
>> "UDF". There are 2 flavors:
>> >>>>>>>>>>>
>> >>>>>>>>>>> (1) Functions that are defined by the user whose definition
>> is a composition of other built-in functions/SQL expressions.
>> >>>>>>>>>>> (2) Custom code written in imperative function according to a
>> Java/Scala/Python API, etc.
>> >>>>>>>>>>>
>> >>>>>>>>>>> All the examples in Ajantha's references are pretty much from
>> (1) and I think those have more analogy to views due to their SQL nature.
>> Agree (2) is not practical to maintain by Iceberg, but I think Ajantha's
>> use cases are around (1), and may be worth evaluating.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>> Walaa.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>> ajanthab...@gmail.com> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
>> think this would be a very difficult area to tackle across engines,
>> languages, and memory models without having a huge performance penalty.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of
>> UDFs (similar to views as shared by the reference links above), the
>> complexity involved will be similar to managing views.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the
>> view spec) this week to facilitate further discussions.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Ajantha
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com>
>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> > While it would be great to have a common set of functions
>> across engines, I don't see how that is practical when those engines are
>> implemented so differently. Plugging in code -- and especially custom
>> user-supplied code -- seems inherently specialized to me and should be part
>> of the engines' design.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> How is this different from the views? I feel we can say
>> exactly the same thing for Iceberg views, but yet we have Iceberg
>> multi-dialect views implemented. Maybe it sounds like we are trying to draw
>> a line between SQL vs other programming language as "code"? but I think SQL
>> is just another type of code, and we are already talking about compiling
>> all these different code dialects to an intermediate representation (using
>> projects like Coral, Substrait), which will be stored as another type of
>> representation of Iceberg view. I think the same functionality can be used
>> for UDFs if developed.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even
>> just a multi-dialect one like view, and that can allow engines to for
>> example parse a view SQL, and when a function referenced cannot be
>> resolved, try to seek for a multi-dialect UDF definition.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I guess we can discuss more when we have the actual
>> proposal published.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>> Jack Ye
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <
>> sn...@snazy.de> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
>> "non-centralized" as views are. The same performance concerns apply to
>> views as well.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines can
>> build, so the argument that UDFs aren't practical, because engines are
>> different, is probably only a temporary concern.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the
>> idea to make views portable, which is conceptually not that much different
>> from portable UDFs.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea
>> of having UDFs in Iceberg, especially not in this early stage.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks, Ajantha.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs
>> tracked by Iceberg catalogs. I think that Iceberg primarily deals with
>> things that are centralized, like tables of data. While it would be great
>> to have a common set of functions across engines, I don't see how that is
>> practical when those engines are implemented so differently. Plugging in
>> code -- and especially custom user-supplied code -- seems inherently
>> specialized to me and should be part of the engines' design.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
>> think this would be a very difficult area to tackle across engines,
>> languages, and memory models without having a huge performance penalty.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Ryan
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>> ajanthab...@gmail.com> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi Everyone,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in
>> storing the Versioned SQL UDFs in Iceberg.
>> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the
>> versioned UDFs in Iceberg (inspired by view spec).
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they
>> are associated with tables, but they can accept arguments and produce
>> return values, or even function as inline expressions.
>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake,
>> Databricks Spark supports SQL UDFs at catalog level [1].
>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially
>> engines can understand the UDFs written by other engines (with the
>> translate layer).
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg
>> would be a valuable addition, and we're eager to collaborate with the
>> community to develop a UDF specification.
>> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to
>> propose to the community.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Let us know your thoughts on this.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>>>>>> Dremio -
>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>> >>>>>>>>>>>>>>> Trino -
>> https://trino.io/docs/current/sql/create-function.html
>> >>>>>>>>>>>>>>> Snowflake -
>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>> >>>>>>>>>>>>>>> Databricks -
>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> - Ajantha
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>> Ryan Blue
>> >>>>>>>>>>>>>> Tabular
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>> Robert Stupp
>> >>>>>>>>>>>>>> @snazy
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Databricks
>>
>

Reply via email to