Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Dmitri Bourlatchkov Thu, 08 Aug 2024 12:44:27 -0700

I do not think the spec is meant to allow only SQL representations,
although it is certainly faviouring SQL in examples... It would be nice to
add a non-SQL example, indeed.


Cheers,
Dmitri.

On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong <fo...@apache.org> wrote:

> Coming from PyIceberg, I have concerns as this proposal focuses on
> SQL-based engines, while Python-based systems often work with data frames.
> Adding imperative languages like Python would make this proposal more
> inclusive.
>
> Kind regards,
> Fokko
>
>
>
> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen <
> piotr.findei...@gmail.com>:
>
>> Hi,
>>
>> Walaa, thanks for asking!
>> In the design doc linked before  in this thread [1] i read
>> "Without a common standard, the UDFs are hard to share among different
>> engines."
>> ("Background and Motivation" section).
>> I agree with this statement. I don't fully understand yet how the
>> proposed design addresses shareability between the engines though.
>> I would use some help to understand this better.
>>
>> Best
>> Piotr
>>
>>
>>
>> [1] SQL User-Defined Function Spec
>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc
>>
>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa <wa.moust...@gmail.com>
>> wrote:
>>
>>> Piotr, what do you mean by making user-created functions shareable
>>> between engines? Do you mean UDFs written in imperative code?
>>>
>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen
>>> <piotr.findei...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Thank you Ajantha for creating this thread. The Iceberg UDFs are an
>>> interesting idea!
>>> > Is there a plan to make the user-created functions sharable between
>>> the engines?
>>> > If so, how would a CREATE FUNCTION statement look like in e..g Spark
>>> or Trino?
>>> >
>>> > Meanwhile, added a few comments in the doc.
>>> >
>>> > Best
>>> > Piotr
>>> >
>>> >
>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid>
>>> wrote:
>>> >>
>>> >> I just looked through the proposal and added comments. I think it
>>> would be helpful to also have a design doc that covers the choices from the
>>> draft spec. For instance, the choice to enumerate all possible function
>>> input struts rather than allowing generics and varargs.
>>> >>
>>> >> Here’s a quick summary of my feedback:
>>> >>
>>> >> I think that the choice to enumerate function signatures is limiting.
>>> It would be nice to see a discussion of the trade-offs and a rationale for
>>> the choice. I think it would also be very helpful to have a few
>>> representative use cases for this included in the doc. That way the
>>> proposal can demonstrate that it solves those use cases with reasonable
>>> trade-offs.
>>> >> There are a few instances where this is inconsistent with conventions
>>> in other specs. For example, using string IDs rather than an integer.
>>> >> This uses a very different model for spec versioning than the Iceberg
>>> view and table specs. It requires readers to fail if there are any unknown
>>> fields, which prevents the spec from adding things that are fully
>>> backward-compatible. Other Iceberg specs only require a version change to
>>> introduce forward-incompatible changes and I think that this should do the
>>> same to avoid confusion.
>>> >> It looks like the intent is to allow multiple function signatures per
>>> verison, but it is unclear how to encode them because a version is
>>> associated with a single function signature.
>>> >> There is no review of SQL syntax for creating functions across
>>> engines, so this doesn’t show that the metadata proposed is sufficient for
>>> cross-engine use cases.
>>> >> The example for a table-valued function shows a SELECT statement and
>>> it isn’t clear how this is distinct from a view
>>> >>
>>> >>
>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> Thanks Walaa and Robert for the review on this.
>>> >>>
>>> >>> We didn't find any blocker for the spec.
>>> >>> I will wait for a week and If no more review comments, I will raise
>>> a PR for spec addition next week.
>>> >>>
>>> >>> If anyone else is interested, please have a look at the proposal
>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>> >>>
>>> >>> - Ajantha
>>> >>>
>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>> >>>>
>>> >>>> Hi Ajantha,
>>> >>>>
>>> >>>> I have left some comments. It is an interesting direction, but
>>> there might be some details that need to be fine tuned.
>>> >>>>
>>> >>>> The doc is here [1] for others who might be interested. Resharing
>>> since I do not think it was directly linked in the thread.
>>> >>>>
>>> >>>> [1]
>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Walaa.
>>> >>>>
>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <
>>> ajanthab...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Hi, just another reminder since we didn't get any review on the
>>> proposal.
>>> >>>>> Initially proposed on June 4.
>>> >>>>>
>>> >>>>> - Ajantha
>>> >>>>>
>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <
>>> ajanthab...@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>> Hi everyone,
>>> >>>>>>
>>> >>>>>> We've only received one review so far (from Benny).
>>> >>>>>>
>>> >>>>>> We would appreciate more eyes on this.
>>> >>>>>>
>>> >>>>>> - Ajantha
>>> >>>>>>
>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <
>>> ajanthab...@gmail.com> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi All,
>>> >>>>>>> Please find the proposal link
>>> >>>>>>> https://github.com/apache/iceberg/issues/10432
>>> >>>>>>>
>>> >>>>>>> Google doc link is attached in the proposal.
>>> >>>>>>> And Thanks Stephen Lin for working on it.
>>> >>>>>>>
>>> >>>>>>> Hope it gives more clarity to take the decisions and how we want
>>> to implement it.
>>> >>>>>>>
>>> >>>>>>> - Ajantha
>>> >>>>>>>
>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user
>>> defined functions. Here are some examples of what I meant in (2):
>>> >>>>>>>>
>>> >>>>>>>> Hive GenericUDF:
>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>> >>>>>>>> Trino user defined functions:
>>> https://trino.io/docs/current/develop/functions.html
>>> >>>>>>>> Flink user defined functions:
>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>> >>>>>>>>
>>> >>>>>>>> Probably what you referred to is a variation of (1) where the
>>> API is data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes,
>>> that is also possible in the very long run :)
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>> Walaa.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com>
>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> > (2) Custom code written in imperative function according to
>>> a Java/Scala/Python API, etc.
>>> >>>>>>>>>
>>> >>>>>>>>> I think we could still explore some long term opportunities in
>>> this case. Consider you register a Spark temp view as some sort of data
>>> frame read, then it could still be resolved to a Spark plan that is
>>> representable by an intermediate representation. But I agree this gets very
>>> complicated very soon, and just having the case (1) covered would already
>>> be a huge step forward.
>>> >>>>>>>>>
>>> >>>>>>>>> -Jack
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com>
>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used
>>> to build a parameterized view.  So, there's definitely a lot in common
>>> between UDFs and views.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I think there is a disconnect about what is perceived as a
>>> "UDF". There are 2 flavors:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> (1) Functions that are defined by the user whose definition
>>> is a composition of other built-in functions/SQL expressions.
>>> >>>>>>>>>>> (2) Custom code written in imperative function according to
>>> a Java/Scala/Python API, etc.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> All the examples in Ajantha's references are pretty much
>>> from (1) and I think those have more analogy to views due to their SQL
>>> nature. Agree (2) is not practical to maintain by Iceberg, but I think
>>> Ajantha's use cases are around (1), and may be worth evaluating.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thanks,
>>> >>>>>>>>>>> Walaa.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>> ajanthab...@gmail.com> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
>>> think this would be a very difficult area to tackle across engines,
>>> languages, and memory models without having a huge performance penalty.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of
>>> UDFs (similar to views as shared by the reference links above), the
>>> complexity involved will be similar to managing views.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>> >>>>>>>>>>>> We will work on publishing the draft spec (inspired by the
>>> view spec) this week to facilitate further discussions.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> - Ajantha
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <
>>> yezhao...@gmail.com> wrote:
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> > While it would be great to have a common set of
>>> functions across engines, I don't see how that is practical when those
>>> engines are implemented so differently. Plugging in code -- and especially
>>> custom user-supplied code -- seems inherently specialized to me and should
>>> be part of the engines' design.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> How is this different from the views? I feel we can say
>>> exactly the same thing for Iceberg views, but yet we have Iceberg
>>> multi-dialect views implemented. Maybe it sounds like we are trying to draw
>>> a line between SQL vs other programming language as "code"? but I think SQL
>>> is just another type of code, and we are already talking about compiling
>>> all these different code dialects to an intermediate representation (using
>>> projects like Coral, Substrait), which will be stored as another type of
>>> representation of Iceberg view. I think the same functionality can be used
>>> for UDFs if developed.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even
>>> just a multi-dialect one like view, and that can allow engines to for
>>> example parse a view SQL, and when a function referenced cannot be
>>> resolved, try to seek for a multi-dialect UDF definition.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> I guess we can discuss more when we have the actual
>>> proposal published.
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> Best,
>>> >>>>>>>>>>>>> Jack Ye
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>>
>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <
>>> sn...@snazy.de> wrote:
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> UDFs are as engine specific and portable and
>>> "non-centralized" as views are. The same performance concerns apply to
>>> views as well.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Iceberg should define a common base upon which engines
>>> can build, so the argument that UDFs aren't practical, because engines are
>>> different, is probably only a temporary concern.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the
>>> idea to make views portable, which is conceptually not that much different
>>> from portable UDFs.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea
>>> of having UDFs in Iceberg, especially not in this early stage.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Thanks, Ajantha.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs
>>> tracked by Iceberg catalogs. I think that Iceberg primarily deals with
>>> things that are centralized, like tables of data. While it would be great
>>> to have a common set of functions across engines, I don't see how that is
>>> practical when those engines are implemented so differently. Plugging in
>>> code -- and especially custom user-supplied code -- seems inherently
>>> specialized to me and should be part of the engines' design.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
>>> think this would be a very difficult area to tackle across engines,
>>> languages, and memory models without having a huge performance penalty.
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> Ryan
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>> ajanthab...@gmail.com> wrote:
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Hi Everyone,
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> This is a discussion to gauge the community interest in
>>> storing the Versioned SQL UDFs in Iceberg.
>>> >>>>>>>>>>>>>>> We want to propose the spec addition for storing the
>>> versioned UDFs in Iceberg (inspired by view spec).
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they
>>> are associated with tables, but they can accept arguments and produce
>>> return values, or even function as inline expressions.
>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake,
>>> Databricks Spark supports SQL UDFs at catalog level [1].
>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>> >>>>>>>>>>>>>>> - Versioning of these UDFs.
>>> >>>>>>>>>>>>>>> - Interoperability between the engines. Potentially
>>> engines can understand the UDFs written by other engines (with the
>>> translate layer).
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg
>>> would be a valuable addition, and we're eager to collaborate with the
>>> community to develop a UDF specification.
>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a specification to
>>> propose to the community.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> Let us know your thoughts on this.
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> [1]
>>> >>>>>>>>>>>>>>> Dremio -
>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>> >>>>>>>>>>>>>>> Trino -
>>> https://trino.io/docs/current/sql/create-function.html
>>> >>>>>>>>>>>>>>> Snowflake -
>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>> >>>>>>>>>>>>>>> Databricks -
>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>> >>>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>> - Ajantha
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> --
>>> >>>>>>>>>>>>>> Ryan Blue
>>> >>>>>>>>>>>>>> Tabular
>>> >>>>>>>>>>>>>>
>>> >>>>>>>>>>>>>> --
>>> >>>>>>>>>>>>>> Robert Stupp
>>> >>>>>>>>>>>>>> @snazy
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Ryan Blue
>>> >> Databricks
>>>
>>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to