Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Piotr Findeisen Wed, 07 Aug 2024 12:00:44 -0700

Hi,

Thank you Ajantha for creating this thread. The Iceberg UDFs are an
interesting idea!
Is there a plan to make the user-created functions sharable between the
engines?
If so, how would a CREATE FUNCTION statement look like in e..g Spark or
Trino?


Meanwhile, added a few comments in the doc.

Best
Piotr


On Thu, 1 Aug 2024 at 20:50, Ryan Blue <b...@databricks.com.invalid> wrote:

> I just looked through the proposal and added comments. I think it would be
> helpful to also have a design doc that covers the choices from the draft
> spec. For instance, the choice to enumerate all possible function input
> struts rather than allowing generics and varargs.
>
> Here’s a quick summary of my feedback:
>
>    - I think that the choice to enumerate function signatures is
>    limiting. It would be nice to see a discussion of the trade-offs and a
>    rationale for the choice. I think it would also be very helpful to have a
>    few representative use cases for this included in the doc. That way the
>    proposal can demonstrate that it solves those use cases with reasonable
>    trade-offs.
>    - There are a few instances where this is inconsistent with
>    conventions in other specs. For example, using string IDs rather than an
>    integer.
>    - This uses a very different model for spec versioning than the
>    Iceberg view and table specs. It requires readers to fail if there are any
>    unknown fields, which prevents the spec from adding things that are fully
>    backward-compatible. Other Iceberg specs only require a version change to
>    introduce forward-incompatible changes and I think that this should do the
>    same to avoid confusion.
>    - It looks like the intent is to allow multiple function signatures
>    per verison, but it is unclear how to encode them because a version is
>    associated with a single function signature.
>    - There is no review of SQL syntax for creating functions across
>    engines, so this doesn’t show that the metadata proposed is sufficient for
>    cross-engine use cases.
>    - The example for a table-valued function shows a SELECT statement and
>    it isn’t clear how this is distinct from a view
>
>
> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>
>> Thanks Walaa and Robert for the review on this.
>>
>> We didn't find any blocker for the spec.
>> I will wait for a week and If no more review comments, I will raise a PR
>> for spec addition next week.
>>
>> If anyone else is interested, please have a look at the proposal
>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>
>> - Ajantha
>>
>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Hi Ajantha,
>>>
>>> I have left some comments. It is an interesting direction, but there
>>> might be some details that need to be fine tuned.
>>>
>>> The doc is here [1] for others who might be interested. Resharing since
>>> I do not think it was directly linked in the thread.
>>>
>>> [1]
>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Hi, just another reminder since we didn't get any review on the
>>>> proposal.
>>>> Initially proposed on June 4.
>>>>
>>>> - Ajantha
>>>>
>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We've only received one review so far (from Benny).
>>>>>
>>>>> We would appreciate more eyes on this.
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>> Please find the proposal link
>>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>>
>>>>>> Google doc link is attached in the proposal.
>>>>>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it.
>>>>>>
>>>>>> Hope it gives more clarity to take the decisions and how we want to
>>>>>> implement it.
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined
>>>>>>> functions. Here are some examples of what I meant in (2):
>>>>>>>
>>>>>>> Hive GenericUDF:
>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>>> Trino user defined functions:
>>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>>> Flink user defined functions:
>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>>
>>>>>>> Probably what you referred to is a variation of (1) where the API is
>>>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, 
>>>>>>> that
>>>>>>> is also possible in the very long run :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>>> > (2) Custom code written in imperative function according to a
>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>
>>>>>>>> I think we could still explore some long term opportunities in this
>>>>>>>> case. Consider you register a Spark temp view as some sort of data 
>>>>>>>> frame
>>>>>>>> read, then it could still be resolved to a Spark plan that is 
>>>>>>>> representable
>>>>>>>> by an intermediate representation. But I agree this gets very 
>>>>>>>> complicated
>>>>>>>> very soon, and just having the case (1) covered would already be a huge
>>>>>>>> step forward.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <btc...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to
>>>>>>>>> build a *parameterized *view.  So, there's definitely a lot in
>>>>>>>>> common between UDFs and views.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I think there is a disconnect about what is perceived as a "UDF".
>>>>>>>>>> There are 2 flavors:
>>>>>>>>>>
>>>>>>>>>> (1) Functions that are defined by the user whose definition is a
>>>>>>>>>> composition of other built-in functions/SQL expressions.
>>>>>>>>>> (2) Custom code written in imperative function according to a
>>>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>>>
>>>>>>>>>> All the examples in Ajantha's references are pretty much from (1)
>>>>>>>>>> and I think those have more analogy to views due to their SQL 
>>>>>>>>>> nature. Agree
>>>>>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's 
>>>>>>>>>> use
>>>>>>>>>> cases are around (1), and may be worth evaluating.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>>> languages,
>>>>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>>>
>>>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>>>>>>>> (similar to views as shared by the reference links above), the 
>>>>>>>>>>> complexity
>>>>>>>>>>> involved will be similar to managing views.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>>> We will work on publishing the draft spec (inspired by the view
>>>>>>>>>>> spec) this week to facilitate further discussions.
>>>>>>>>>>>
>>>>>>>>>>> - Ajantha
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > While it would be great to have a common set of functions
>>>>>>>>>>>> across engines, I don't see how that is practical when those 
>>>>>>>>>>>> engines are
>>>>>>>>>>>> implemented so differently. Plugging in code -- and especially 
>>>>>>>>>>>> custom
>>>>>>>>>>>> user-supplied code -- seems inherently specialized to me and 
>>>>>>>>>>>> should be part
>>>>>>>>>>>> of the engines' design.
>>>>>>>>>>>>
>>>>>>>>>>>> How is this different from the views? I feel we can say exactly
>>>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg 
>>>>>>>>>>>> multi-dialect
>>>>>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a 
>>>>>>>>>>>> line
>>>>>>>>>>>> between SQL vs other programming language as "code"? but I think 
>>>>>>>>>>>> SQL is
>>>>>>>>>>>> just another type of code, and we are already talking about 
>>>>>>>>>>>> compiling all
>>>>>>>>>>>> these different code dialects to an intermediate representation 
>>>>>>>>>>>> (using
>>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another 
>>>>>>>>>>>> type of
>>>>>>>>>>>> representation of Iceberg view. I think the same functionality can 
>>>>>>>>>>>> be used
>>>>>>>>>>>> for UDFs if developed.
>>>>>>>>>>>>
>>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>>>>>>>> multi-dialect one like view, and that can allow engines to for 
>>>>>>>>>>>> example
>>>>>>>>>>>> parse a view SQL, and when a function referenced cannot be 
>>>>>>>>>>>> resolved, try to
>>>>>>>>>>>> seek for a multi-dialect UDF definition.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>>>>>>>> published.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <sn...@snazy.de>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized"
>>>>>>>>>>>>> as views are. The same performance concerns apply to views as 
>>>>>>>>>>>>> well.
>>>>>>>>>>>>> Iceberg should define a common base upon which engines can
>>>>>>>>>>>>> build, so the argument that UDFs aren't practical, because 
>>>>>>>>>>>>> engines are
>>>>>>>>>>>>> different, is probably only a temporary concern.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea
>>>>>>>>>>>>> to make views portable, which is conceptually not that much 
>>>>>>>>>>>>> different from
>>>>>>>>>>>>> portable UDFs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of
>>>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs
>>>>>>>>>>>>> tracked by Iceberg catalogs. I think that Iceberg primarily deals 
>>>>>>>>>>>>> with
>>>>>>>>>>>>> things that are centralized, like tables of data. While it would 
>>>>>>>>>>>>> be great
>>>>>>>>>>>>> to have a common set of functions across engines, I don't see how 
>>>>>>>>>>>>> that is
>>>>>>>>>>>>> practical when those engines are implemented so differently. 
>>>>>>>>>>>>> Plugging in
>>>>>>>>>>>>> code -- and especially custom user-supplied code -- seems 
>>>>>>>>>>>>> inherently
>>>>>>>>>>>>> specialized to me and should be part of the engines' design.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I
>>>>>>>>>>>>> think this would be a very difficult area to tackle across 
>>>>>>>>>>>>> engines,
>>>>>>>>>>>>> languages, and memory models without having a huge performance 
>>>>>>>>>>>>> penalty.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is a discussion to gauge the community interest in
>>>>>>>>>>>>>> storing the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>>>> We want to propose the spec addition for storing the
>>>>>>>>>>>>>> versioned UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>>>>>>>> associated with tables, but they can accept arguments and 
>>>>>>>>>>>>>> produce return
>>>>>>>>>>>>>> values, or even function as inline expressions.
>>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks
>>>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>>> - Interoperability between the engines. Potentially engines
>>>>>>>>>>>>>> can understand the UDFs written by other engines (with the 
>>>>>>>>>>>>>> translate layer).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We believe that integrating this feature into Iceberg would
>>>>>>>>>>>>>> be a valuable addition, and we're eager to collaborate with the 
>>>>>>>>>>>>>> community
>>>>>>>>>>>>>> to develop a UDF specification.
>>>>>>>>>>>>>> Stephen <stephen....@dremio.com> has already begun drafting
>>>>>>>>>>>>>> a specification to propose to the community.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>>> Trino -
>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>>> @snazy
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>
> --
> Ryan Blue
> Databricks
>

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to