Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Ryan Blue Thu, 01 Aug 2024 11:50:04 -0700

I just looked through the proposal and added comments. I think it would be
helpful to also have a design doc that covers the choices from the draft
spec. For instance, the choice to enumerate all possible function input
struts rather than allowing generics and varargs.


Here’s a quick summary of my feedback:

   - I think that the choice to enumerate function signatures is limiting.
   It would be nice to see a discussion of the trade-offs and a rationale for
   the choice. I think it would also be very helpful to have a few
   representative use cases for this included in the doc. That way the
   proposal can demonstrate that it solves those use cases with reasonable
   trade-offs.
   - There are a few instances where this is inconsistent with conventions
   in other specs. For example, using string IDs rather than an integer.
   - This uses a very different model for spec versioning than the Iceberg
   view and table specs. It requires readers to fail if there are any unknown
   fields, which prevents the spec from adding things that are fully
   backward-compatible. Other Iceberg specs only require a version change to
   introduce forward-incompatible changes and I think that this should do the
   same to avoid confusion.
   - It looks like the intent is to allow multiple function signatures per
   verison, but it is unclear how to encode them because a version is
   associated with a single function signature.
   - There is no review of SQL syntax for creating functions across
   engines, so this doesn’t show that the metadata proposed is sufficient for
   cross-engine use cases.
   - The example for a table-valued function shows a SELECT statement and
   it isn’t clear how this is distinct from a view


On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat <[email protected]> wrote:

> Thanks Walaa and Robert for the review on this.
>
> We didn't find any blocker for the spec.
> I will wait for a week and If no more review comments, I will raise a PR
> for spec addition next week.
>
> If anyone else is interested, please have a look at the proposal
> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>
> - Ajantha
>
> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> Hi Ajantha,
>>
>> I have left some comments. It is an interesting direction, but there
>> might be some details that need to be fine tuned.
>>
>> The doc is here [1] for others who might be interested. Resharing since I
>> do not think it was directly linked in the thread.
>>
>> [1]
>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit
>>
>> Thanks,
>> Walaa.
>>
>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat <[email protected]>
>> wrote:
>>
>>> Hi, just another reminder since we didn't get any review on the
>>> proposal.
>>> Initially proposed on June 4.
>>>
>>> - Ajantha
>>>
>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> We've only received one review so far (from Benny).
>>>>
>>>> We would appreciate more eyes on this.
>>>>
>>>> - Ajantha
>>>>
>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>> Please find the proposal link
>>>>> https://github.com/apache/iceberg/issues/10432
>>>>>
>>>>> Google doc link is attached in the proposal.
>>>>> And Thanks Stephen Lin <https://github.com/sxlin> for working on it.
>>>>>
>>>>> Hope it gives more clarity to take the decisions and how we want to
>>>>> implement it.
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin Moustafa <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Jack. I actually meant scalar/aggregate/table user defined
>>>>>> functions. Here are some examples of what I meant in (2):
>>>>>>
>>>>>> Hive GenericUDF:
>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
>>>>>> Trino user defined functions:
>>>>>> https://trino.io/docs/current/develop/functions.html
>>>>>> Flink user defined functions:
>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/
>>>>>>
>>>>>> Probably what you referred to is a variation of (1) where the API is
>>>>>> data flow/data pipeline API instead of SQL (e.g., Spark Scala). Yes, that
>>>>>> is also possible in the very long run :)
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye <[email protected]> wrote:
>>>>>>
>>>>>>> > (2) Custom code written in imperative function according to a
>>>>>>> Java/Scala/Python API, etc.
>>>>>>>
>>>>>>> I think we could still explore some long term opportunities in this
>>>>>>> case. Consider you register a Spark temp view as some sort of data frame
>>>>>>> read, then it could still be resolved to a Spark plan that is 
>>>>>>> representable
>>>>>>> by an intermediate representation. But I agree this gets very 
>>>>>>> complicated
>>>>>>> very soon, and just having the case (1) covered would already be a huge
>>>>>>> step forward.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow <[email protected]> wrote:
>>>>>>>
>>>>>>>> It's interesting to note that a tabular SQL UDF can be used to
>>>>>>>> build a *parameterized *view.  So, there's definitely a lot in
>>>>>>>> common between UDFs and views.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa Eldin Moustafa <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think there is a disconnect about what is perceived as a "UDF".
>>>>>>>>> There are 2 flavors:
>>>>>>>>>
>>>>>>>>> (1) Functions that are defined by the user whose definition is a
>>>>>>>>> composition of other built-in functions/SQL expressions.
>>>>>>>>> (2) Custom code written in imperative function according to a
>>>>>>>>> Java/Scala/Python API, etc.
>>>>>>>>>
>>>>>>>>> All the examples in Ajantha's references are pretty much from (1)
>>>>>>>>> and I think those have more analogy to views due to their SQL nature. 
>>>>>>>>> Agree
>>>>>>>>> (2) is not practical to maintain by Iceberg, but I think Ajantha's use
>>>>>>>>> cases are around (1), and may be worth evaluating.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha Bhat <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>> languages,
>>>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>>
>>>>>>>>>> Assuming Iceberg initially supports SQL representations of UDFs
>>>>>>>>>> (similar to views as shared by the reference links above), the 
>>>>>>>>>> complexity
>>>>>>>>>> involved will be similar to managing views.
>>>>>>>>>>
>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your input.
>>>>>>>>>> We will work on publishing the draft spec (inspired by the view
>>>>>>>>>> spec) this week to facilitate further discussions.
>>>>>>>>>>
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > While it would be great to have a common set of functions
>>>>>>>>>>> across engines, I don't see how that is practical when those 
>>>>>>>>>>> engines are
>>>>>>>>>>> implemented so differently. Plugging in code -- and especially 
>>>>>>>>>>> custom
>>>>>>>>>>> user-supplied code -- seems inherently specialized to me and should 
>>>>>>>>>>> be part
>>>>>>>>>>> of the engines' design.
>>>>>>>>>>>
>>>>>>>>>>> How is this different from the views? I feel we can say exactly
>>>>>>>>>>> the same thing for Iceberg views, but yet we have Iceberg 
>>>>>>>>>>> multi-dialect
>>>>>>>>>>> views implemented. Maybe it sounds like we are trying to draw a line
>>>>>>>>>>> between SQL vs other programming language as "code"? but I think 
>>>>>>>>>>> SQL is
>>>>>>>>>>> just another type of code, and we are already talking about 
>>>>>>>>>>> compiling all
>>>>>>>>>>> these different code dialects to an intermediate representation 
>>>>>>>>>>> (using
>>>>>>>>>>> projects like Coral, Substrait), which will be stored as another 
>>>>>>>>>>> type of
>>>>>>>>>>> representation of Iceberg view. I think the same functionality can 
>>>>>>>>>>> be used
>>>>>>>>>>> for UDFs if developed.
>>>>>>>>>>>
>>>>>>>>>>> I actually hink adding UDF support is a good idea, even just a
>>>>>>>>>>> multi-dialect one like view, and that can allow engines to for 
>>>>>>>>>>> example
>>>>>>>>>>> parse a view SQL, and when a function referenced cannot be 
>>>>>>>>>>> resolved, try to
>>>>>>>>>>> seek for a multi-dialect UDF definition.
>>>>>>>>>>>
>>>>>>>>>>> I guess we can discuss more when we have the actual proposal
>>>>>>>>>>> published.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert Stupp <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> UDFs are as engine specific and portable and "non-centralized"
>>>>>>>>>>>> as views are. The same performance concerns apply to views as well.
>>>>>>>>>>>> Iceberg should define a common base upon which engines can
>>>>>>>>>>>> build, so the argument that UDFs aren't practical, because engines 
>>>>>>>>>>>> are
>>>>>>>>>>>> different, is probably only a temporary concern.
>>>>>>>>>>>>
>>>>>>>>>>>> In the long term, Iceberg should also try to tackle the idea to
>>>>>>>>>>>> make views portable, which is conceptually not that much different 
>>>>>>>>>>>> from
>>>>>>>>>>>> portable UDFs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> PS: I'm not a fan of adding a negative touch to the idea of
>>>>>>>>>>>> having UDFs in Iceberg, especially not in this early stage.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Ajantha.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm skeptical about whether it's a good idea to add UDFs
>>>>>>>>>>>> tracked by Iceberg catalogs. I think that Iceberg primarily deals 
>>>>>>>>>>>> with
>>>>>>>>>>>> things that are centralized, like tables of data. While it would 
>>>>>>>>>>>> be great
>>>>>>>>>>>> to have a common set of functions across engines, I don't see how 
>>>>>>>>>>>> that is
>>>>>>>>>>>> practical when those engines are implemented so differently. 
>>>>>>>>>>>> Plugging in
>>>>>>>>>>>> code -- and especially custom user-supplied code -- seems 
>>>>>>>>>>>> inherently
>>>>>>>>>>>> specialized to me and should be part of the engines' design.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess we'll know more when you post the proposal, but I think
>>>>>>>>>>>> this would be a very difficult area to tackle across engines, 
>>>>>>>>>>>> languages,
>>>>>>>>>>>> and memory models without having a huge performance penalty.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM Ajantha Bhat <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a discussion to gauge the community interest in
>>>>>>>>>>>>> storing the Versioned SQL UDFs in Iceberg.
>>>>>>>>>>>>> We want to propose the spec addition for storing the versioned
>>>>>>>>>>>>> UDFs in Iceberg (inspired by view spec).
>>>>>>>>>>>>>
>>>>>>>>>>>>> These UDFs can operate similarly to views in that they are
>>>>>>>>>>>>> associated with tables, but they can accept arguments and produce 
>>>>>>>>>>>>> return
>>>>>>>>>>>>> values, or even function as inline expressions.
>>>>>>>>>>>>> Many Query engines like Dremio, Trino, Snowflake, Databricks
>>>>>>>>>>>>> Spark supports SQL UDFs at catalog level [1].
>>>>>>>>>>>>> But storing them in Iceberg can enable
>>>>>>>>>>>>> - Versioning of these UDFs.
>>>>>>>>>>>>> - Interoperability between the engines. Potentially engines
>>>>>>>>>>>>> can understand the UDFs written by other engines (with the 
>>>>>>>>>>>>> translate layer).
>>>>>>>>>>>>>
>>>>>>>>>>>>> We believe that integrating this feature into Iceberg would be
>>>>>>>>>>>>> a valuable addition, and we're eager to collaborate with the 
>>>>>>>>>>>>> community to
>>>>>>>>>>>>> develop a UDF specification.
>>>>>>>>>>>>> Stephen <[email protected]> has already begun drafting a
>>>>>>>>>>>>> specification to propose to the community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let us know your thoughts on this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> Dremio -
>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function
>>>>>>>>>>>>> Trino - https://trino.io/docs/current/sql/create-function.html
>>>>>>>>>>>>> Snowflake -
>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions
>>>>>>>>>>>>> Databricks -
>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Ajantha
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Robert Stupp
>>>>>>>>>>>> @snazy
>>>>>>>>>>>>
>>>>>>>>>>>>

-- 
Ryan Blue
Databricks

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

Reply via email to