Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Piotr Findeisen Thu, 08 Aug 2024 01:26:35 -0700

Hi,

Walaa, thanks for asking!
In the design doc linked before  in this thread [1] i read
"Without a common standard, the UDFs are hard to share among different
engines."
("Background and Motivation" section).
I agree with this statement. I don't fully understand yet how the proposed
design addresses shareability between the engines though.
I would use some help to understand this better.


Best
Piotr



[1] SQL User-Defined Function Spec
https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc

On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <[email protected]>
wrote:

> Piotr, what do you mean by making user-created functions shareable
> between engines? Do you mean UDFs written in imperative code?
>
> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
> <[email protected]> wrote:
> >
> > Hi,
> >
> > Thank you Ajantha for creating this thread. The Iceberg UDFs are an
> interesting idea!
> > Is there a plan to make the user-created functions sharable between the
> engines?
> > If so, how would a CREATE FUNCTION statement look like in e..g Spark or
> Trino?
> >
> > Meanwhile, added a few comments in the doc.
> >
> > Best
> > Piotr
> >
> >
> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <[email protected]>
> wrote:
> >>
> >> I just looked through the proposal and added comments. I think it would
> be helpful to also have a design doc that covers the choices from the draft
> spec. For instance, the choice to enumerate all possible function input
> struts rather than allowing generics and varargs.
> >>
> >> Here’s a quick summary of my feedback:
> >>
> >> I think that the choice to enumerate function signatures is limiting.
> It would be nice to see a discussion of the trade-offs and a rationale for
> the choice. I think it would also be very helpful to have a few
> representative use cases for this included in the doc. That way the
> proposal can demonstrate that it solves those use cases with reasonable
> trade-offs.
> >> There are a few instances where this is inconsistent with conventions
> in other specs. For example, using string IDs rather than an integer.
> >> This uses a very different model for spec versioning than the Iceberg
> view and table specs. It requires readers to fail if there are any unknown
> fields, which prevents the spec from adding things that are fully
> backward-compatible. Other Iceberg specs only require a version change to
> introduce forward-incompatible changes and I think that this should do the
> same to avoid confusion.
> >> It looks like the intent is to allow multiple function signatures per
> verison, but it is unclear how to encode them because a version is
> associated with a single function signature.
> >> There is no review of SQL syntax for creating functions across engines,
> so this doesn’t show that the metadata proposed is sufficient for
> cross-engine use cases.
> >> The example for a table-valued function shows a SELECT statement and it
> isn’t clear how this is distinct from a view
> >>
> >>
> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <[email protected]>
> wrote:
> >>>
> >>> Thanks Walaa and Robert for the review on this.
> >>>
> >>> We didn't find any blocker for the spec.
> >>> I will wait for a week and If no more review comments, I will raise a
> PR for spec addition next week.
> >>>
> >>> If anyone else is interested, please have a look at the proposal
> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
> >>>
> >>> - Ajantha
> >>>
> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >>>>
> >>>> Hi Ajantha,
> >>>>
> >>>> I have left some comments. It is an interesting direction, but there
> might be some details that need to be fine tuned.
> >>>>
> >>>> The doc is here [1] for others who might be interested. Resharing
> since I do not think it was directly linked in the thread.
> >>>>
> >>>> [1]
> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
> >>>>
> >>>> Thanks,
> >>>> Walaa.
> >>>>
> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <[email protected]>
> wrote:
> >>>>>
> >>>>> Hi, just another reminder since we didn't get any review on the
> proposal.
> >>>>> Initially proposed on June 4.
> >>>>>
> >>>>> - Ajantha
> >>>>>
> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <[email protected]>
> wrote:
> >>>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> We've only received one review so far (from Benny).
> >>>>>>
> >>>>>> We would appreciate more eyes on this.
> >>>>>>
> >>>>>> - Ajantha
> >>>>>>
> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> Hi All,
> >>>>>>> Please find the proposal link
> >>>>>>> https://github.com/apache/iceberg/issues/10432
> >>>>>>>
> >>>>>>> Google doc link is attached in the proposal.
> >>>>>>> And Thanks Stephen Lin for working on it.
> >>>>>>>
> >>>>>>> Hope it gives more clarity to take the decisions and how we want
> to implement it.
> >>>>>>>
> >>>>>>> - Ajantha
> >>>>>>>
> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined
> functions. Here are some examples of what I meant in (2):
> >>>>>>>>
> >>>>>>>> Hive GenericUDF:
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
> >>>>>>>> Trino user defined functions:
> https://trino.io/docs/current/develop/functions.html
> >>>>>>>> Flink user defined functions:
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
> >>>>>>>>
> >>>>>>>> Probably what you referred to is a variation of (1) where the API
> is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes,
> that is also possible in the very long run :)
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Walaa.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]>
> wrote:
> >>>>>>>>>
> >>>>>>>>> > (2) Custom code written in imperative function according to a
> Java/Scala/Python API, etc.
> >>>>>>>>>
> >>>>>>>>> I think we could still explore some long term opportunities in
> this case. Consider you register a Spark temp view as some sort of data
> frame read, then it could still be resolved to a Spark plan that is
> representable by an intermediate representation. But I agree this gets very
> complicated very soon, and just having the case (1) covered would already
> be a huge step forward.
> >>>>>>>>>
> >>>>>>>>> -Jack
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to
> build a parameterized view.  So, there's definitely a lot in common between
> UDFs and views.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I think there is a disconnect about what is perceived as a
> "UDF". There are 2 flavors:
> >>>>>>>>>>>
> >>>>>>>>>>> (1) Functions that are defined by the user whose definition is
> a composition of other built-in functions/SQL expressions.
> >>>>>>>>>>> (2) Custom code written in imperative function according to a
> Java/Scala/Python API, etc.
> >>>>>>>>>>>
> >>>>>>>>>>> All the examples in Ajantha's references are pretty much from
> (1) and I think those have more analogy to views due to their SQL nature.
> Agree (2) is not practical to maintain by Iceberg, but I think Ajantha's
> use cases are around (1), and may be worth evaluating.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Walaa.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
> [email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
> think this would be a very difficult area to tackle across engines,
> languages, and memory models without having a huge performance penalty.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of
> UDFs (similar to views as shared by the reference links above), the
> complexity involved will be similar to managing views.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the
> view spec) this week to facilitate further discussions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Ajantha
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]>
> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> > While it would be great to have a common set of functions
> across engines, I don't see how that is practical when those engines are
> implemented so differently. Plugging in code -- and especially custom
> user-supplied code -- seems inherently specialized to me and should be part
> of the engines' design.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> How is this different from the views? I feel we can say
> exactly the same thing for Iceberg views, but yet we have Iceberg
> multi-dialect views implemented. Maybe it sounds like we are trying to draw
> a line between SQL vs other programming language as "code"? but I think SQL
> is just another type of code, and we are already talking about compiling
> all these different code dialects to an intermediate representation (using
> projects like Coral, Substrait), which will be stored as another type of
> representation of Iceberg view. I think the same functionality can be used
> for UDFs if developed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just
> a multi-dialect one like view, and that can allow engines to for example
> parse a view SQL, and when a function referenced cannot be resolved, try to
> seek for a multi-dialect UDF definition.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I guess we can discuss more when we have the actual proposal
> published.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Jack Ye
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]>
> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
> "non-centralized" as views are. The same performance concerns apply to
> views as well.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines can
> build, so the argument that UDFs aren't practical, because engines are
> different, is probably only a temporary concern.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the
> idea to make views portable, which is conceptually not that much different
> from portable UDFs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of
> having UDFs in Iceberg, especially not in this early stage.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks, Ajantha.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs
> tracked by Iceberg catalogs. I think that Iceberg primarily deals with
> things that are centralized, like tables of data. While it would be great
> to have a common set of functions across engines, I don't see how that is
> practical when those engines are implemented so differently. Plugging in
> code -- and especially custom user-supplied code -- seems inherently
> specialized to me and should be part of the engines' design.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
> think this would be a very difficult area to tackle across engines,
> languages, and memory models without having a huge performance penalty.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ryan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
> [email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in
> storing the Versioned SQL UDFs in Iceberg.
> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the
> versioned UDFs in Iceberg (inspired by view spec).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are
> associated with tables, but they can accept arguments and produce return
> values, or even function as inline expressions.
> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake,
> Databricks Spark supports SQL UDFs at catalog level [1].
> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
> >>>>>>>>>>>>>>> - Versioning of these UDFs.
> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially
> engines can understand the UDFs written by other engines (with the
> translate layer).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg
> would be a valuable addition, and we're eager to collaborate with the
> community to develop a UDF specification.
> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to
> propose to the community.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Let us know your thoughts on this.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>> Dremio -
> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
> >>>>>>>>>>>>>>> Trino -
> https://trino.io/docs/current/sql/create-function.html
> >>>>>>>>>>>>>>> Snowflake -
> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
> >>>>>>>>>>>>>>> Databricks -
> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Ajantha
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Ryan Blue
> >>>>>>>>>>>>>> Tabular
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Robert Stupp
> >>>>>>>>>>>>>> @snazy
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Databricks
>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to