Hi Ajantha, I see that the UDF Sync is scheduled in the "Iceberg Dev Events" calendar for tomorrow 7/28 at 9AM PT. I missed the last one, but i'll be at this one.
Best, Kevin Liu On Mon, Jul 14, 2025 at 9:22 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Hey everyone, > > No one joined the sync today. I came to know that Yufei is on holiday, and > Ryan and others couldn't make it, similar to the last sync. It seems Yufei > might have forgotten to transfer meeting ownership as well, as new members > needed admin approval and couldn't join automatically this week. Also, I > can understand it is summer holiday season for many. > > I've updated the function signature schema and other open points. I > believe we're very close to the final version of the spec. A meeting is > indeed necessary to finalize this, but we don't have to wait for it to > finish the review process. We had many meetings on this in the past > already. So, please review the document at your earliest convenience. If we > agree on the spec by next week, I can raise a PR. > > - Ajantha > > On Thu, Jul 3, 2025 at 4:03 AM Yufei Gu <flyrain...@gmail.com> wrote: > >> I’d propose to move the field `properties` from a top level field to a >> field inside “version” along with a representation, so that properties are >> versioned. A property like “deterministic” could change along with >> representation over time. For example, we need to change “deterministic” >> from true to false in case of adding a non-deterministic SQL >> expression/function(e.g., now()) inside an UDF. Otherwise, rollback won't >> be safe. >> >> That said, it's still an open question whether we need any non-versioned >> properties. We can introduce them later if a use case arises. >> >> Yufei >> >> >> On Wed, Jul 2, 2025 at 3:06 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> Thanks for the summary, Ajantha! >>> >>> I’d prefer to keep the signature list separate from the representation >>> history. Here are reasons: >>> >>> 1. Each version still enforces a single signature. Although the >>> signatures array is global to the UDF, each version references just one >>> signature ID. Rollbacks to historical versions remain safe. >>> 2. We’ve separated the less frequently changing component >>> (signatures) from the more dynamic one (representations) to reduce >>> metadata >>> file size. >>> 3. Since signatures use Iceberg data types, they should remain >>> unaffected by multi-dialect representation differences. >>> >>> Yufei >>> >>> >>> On Mon, Jun 30, 2025 at 11:28 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Thanks to everyone who joined the sync. >>>> Here is the meeting recording: >>>> https://drive.google.com/file/d/1FcOSbHo9ZIVeZXdUlmoG42o-chB7Q15P/view?usp=sharing >>>> >>>> Summary: >>>> We have discussed the action items from the last sync (*see Appendix C* in >>>> the proposal doc) >>>> >>>> - Function overloading: Supported by few of the engines and in the >>>> roadmaps of many engines. Iceberg will support it. We will maintain the >>>> `FunctionIdentifier` (extends `TableIdentifer` but also have a member >>>> containing the function argument's type list). And all operations like >>>> load, rename, list, create and drop are based on `FunctionIdentifier`. >>>> - Secure UDF: If we store it as a property in a bag, we need to >>>> standardize the property name. Iceberg encryption may be orthogonal to >>>> this >>>> discussion. >>>> - UDF with multi statement and procedural bodies are supported by >>>> some engines. Iceberg will support it. Store the body as it is while >>>> creating function by the engine. >>>> >>>> new discussions around >>>> >>>> - Standardizing the property names (deterministic, secure). >>>> - About the rename function. >>>> - Replace function. To check upto what level replace is supported >>>> (considering function overloading) . >>>> - Signature should be associated with representation? >>>> >>>> I think we are close on the spec. Please review the proposal >>>> >>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing> >>>> . >>>> >>>> Details for next Iceberg UDF sync: >>>> >>>> *Monday, July 14 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>> Google Meet joining info >>>> Video call link: https://meet.google.com/aui-czix-nbh >>>> >>>> - Ajantha >>>> >>>> On Mon, Jun 30, 2025 at 9:27 PM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Can it be handled by Iceberg encryption? If the whole metadata is >>>>> encrypted, we don't have to worry about just hiding the UDF body? Let us >>>>> discuss more on the sync today. >>>>> >>>>> On Mon, Jun 30, 2025 at 9:22 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>>> >>>>>> Yes, hiding the definition and disabling pushdown are required.We >>>>>> will need a named key(e.g., secure) somewhere, no matter if it is a top >>>>>> level property or a key as a part of the UDF properties. So that both UDF >>>>>> creator and consumer can recognize it. >>>>>> >>>>>> Yufei >>>>>> >>>>>> >>>>>> On Thu, Jun 26, 2025 at 4:27 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for the extra detail. What do you think the spec would >>>>>>> require? Would it require hiding the UDF definition from users and >>>>>>> require >>>>>>> specific pushdown cases be disabled? The use cases seem valid, but I'm >>>>>>> trying to understand the requirements this places on engines and why it >>>>>>> needs to be part of the spec, rather than part of the properties of the >>>>>>> UDF. >>>>>>> >>>>>>> On Fri, Jun 20, 2025 at 3:56 PM Yufei Gu <flyrain...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ryan, >>>>>>>> >>>>>>>> Here are the main use cases for secure UDFs: >>>>>>>> >>>>>>>> 1. >>>>>>>> >>>>>>>> Hiding UDF Definitions: This includes concealing the UDF body >>>>>>>> and details like the list of imports, some of them aren’t >>>>>>>> applicable to SQL >>>>>>>> UDFs. >>>>>>>> 2. >>>>>>>> >>>>>>>> Sandboxed Execution: Ensuring the UDF runs in an isolated >>>>>>>> environment. Again, this typically doesn’t apply to SQL UDFs. >>>>>>>> 3. >>>>>>>> >>>>>>>> Preventing Data Leakage at Execution Time: For example, secure >>>>>>>> UDFs may disable certain optimizations—such as predicate >>>>>>>> pushdown—to avoid >>>>>>>> exposing sensitive data indirectly. [1] >>>>>>>> >>>>>>>> Given these scenarios, I agree with your point that the secure >>>>>>>> flag is primarily an instruction to the engine to behave differently. >>>>>>>> While >>>>>>>> it's largely an engine-side behavior, we still need to include this >>>>>>>> flag in >>>>>>>> the UDF definition to indicate whether a UDF is secure, especially >>>>>>>> considering the perf penalty introduced by scenario #3. We should >>>>>>>> clearly >>>>>>>> recommend that users avoid marking UDFs as secure unless it's truly >>>>>>>> necessary. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.snowflake.com/en/developer-guide/pushdown-optimization#example-of-indirect-data-exposure-through-pushdown >>>>>>>> Yufei >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 18, 2025 at 12:32 PM Ryan Blue <rdb...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yufei, could you make the argument for supporting a "secure" UDF? >>>>>>>>> What use case are you addressing and what specifically changes about >>>>>>>>> how >>>>>>>>> the UDF is handled? If the idea is to hide the UDF definition, do we >>>>>>>>> need >>>>>>>>> to include it? >>>>>>>>> >>>>>>>>> I think this would be a signal to a "trusted engine". When the >>>>>>>>> engine interacts with the catalog it sends authorization information >>>>>>>>> about >>>>>>>>> itself in addition to the user that it is acting on behalf of. That >>>>>>>>> way the >>>>>>>>> catalog knows that the secure UDF can be sent to the engine and won't >>>>>>>>> be >>>>>>>>> shown to the user. The majority of this logic is on the REST server >>>>>>>>> side, >>>>>>>>> and the only part that is communicated to the client is the request >>>>>>>>> not to >>>>>>>>> show the UDF to the user, right? In that case should this be a >>>>>>>>> property >>>>>>>>> rather than part of the definition? Even if we state that the client >>>>>>>>> "must" >>>>>>>>> suppress the UDF definition, it's really just a request. Only trusted >>>>>>>>> engines can be passed the UDF definition, so a spec requirement to >>>>>>>>> suppress >>>>>>>>> the definition isn't very meaningful. >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2025 at 5:42 PM Yufei Gu <flyrain...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the summary, Ajantha! >>>>>>>>>> >>>>>>>>>> Multi-statement UDFs are definitely useful, but whether those >>>>>>>>>> statements run within a single transaction should be treated as an >>>>>>>>>> engine-level concern. The Iceberg UDF spec can spell out the >>>>>>>>>> expectation, >>>>>>>>>> yet the actual guarantee still depends on the runtime. Even if a UDF >>>>>>>>>> declares itself transactional, the engine may or may not enforce it. >>>>>>>>>> >>>>>>>>>> One more thing: should we also introduce a “secure UDF” option >>>>>>>>>> supported by some engines[1], so the body and any sensitive details >>>>>>>>>> stay >>>>>>>>>> hidden from callers? >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://docs.snowflake.com/en/developer-guide/secure-udf-procedure >>>>>>>>>> >>>>>>>>>> Yufei >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 16, 2025 at 12:02 PM Ajantha Bhat < >>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>> https://drive.google.com/file/d/10_Getaasv6tDMGzeZQUgcUVwCUAaFxiz/view?usp=sharing >>>>>>>>>>> Summary: >>>>>>>>>>> >>>>>>>>>>> - We have gone through the SQL UDF syntax supported by >>>>>>>>>>> different engines (Snowflake, databricks, Dremio, Trino, OSS >>>>>>>>>>> spark 4.0). >>>>>>>>>>> - Each engine uses its own block separator, like $$ or '' or >>>>>>>>>>> none. Action item was to check whether engines support >>>>>>>>>>> multi-statement >>>>>>>>>>> (transactional) UDF bodies. >>>>>>>>>>> - Discussed about function overloading. Need to check >>>>>>>>>>> whether these engines support function overloading for SQL UDFs. >>>>>>>>>>> Postgres >>>>>>>>>>> supports it! If yes, need to adopt the spec to handle it. >>>>>>>>>>> - Started online spec review and discussed the deterministic >>>>>>>>>>> flag and concluded that we keep the independent fields (like >>>>>>>>>>> deterministic) >>>>>>>>>>> in spec only if the majority of engines supports it. Else it >>>>>>>>>>> will be passed >>>>>>>>>>> in a property bag (engine specific). And it is the engine's >>>>>>>>>>> responsibility to honor those optional properties. >>>>>>>>>>> >>>>>>>>>>> Feel free to review the current proposal document here >>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing>. >>>>>>>>>>> >>>>>>>>>>> Final spec will be put to review and vote once it is ready. >>>>>>>>>>> >>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>> >>>>>>>>>>> *Monday, June 30 · 9:00 – 10:00am*Time zone: America/Los_Angeles >>>>>>>>>>> Google Meet joining info >>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>> >>>>>>>>>>> - Ajantha >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 4, 2025 at 9:00 PM Ajantha Bhat < >>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks to everyone who joined the sync. >>>>>>>>>>>> Here is the meeting recording: >>>>>>>>>>>> https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing >>>>>>>>>>>> >>>>>>>>>>>> Summary: >>>>>>>>>>>> >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> We discussed including Python support; the majority agreed *not >>>>>>>>>>>> to* (see recording for details). >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> No strong opposition to versioning — it will be included to >>>>>>>>>>>> support change tracking and similar use cases. >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> Suggestions were made to document how each catalog resolves >>>>>>>>>>>> UDFs, similar to views and tables. >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> We agreed not to deviate from the existing table/view spec >>>>>>>>>>>> — e.g., location will remain *required* for cross-catalog >>>>>>>>>>>> compatibility. >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> We also discussed a bit about view interoperability as the >>>>>>>>>>>> same things are applicable here. >>>>>>>>>>>> >>>>>>>>>>>> Feel free to review the proposal document >>>>>>>>>>>> >>>>>>>>>>>> <https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?pli=1&tab=t.0> >>>>>>>>>>>> here. >>>>>>>>>>>> With the current scope, it is similar to the view/table spec >>>>>>>>>>>> now. >>>>>>>>>>>> Final spec will be put to review and vote once it is ready. >>>>>>>>>>>> >>>>>>>>>>>> Details for next Iceberg UDF sync: >>>>>>>>>>>> >>>>>>>>>>>> *Monday, June 16 · 9:00 – 10:00am*Time zone: >>>>>>>>>>>> America/Los_Angeles >>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 21, 2025 at 3:33 AM Yufei Gu <flyrain...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>> >>>>>>>>>>>>> We’ve set up a dedicated bi-weekly community sync for the UDF >>>>>>>>>>>>> project. Everyone’s welcome to drop in and share ideas! Here is >>>>>>>>>>>>> the meeting >>>>>>>>>>>>> link: >>>>>>>>>>>>> >>>>>>>>>>>>> Iceberg UDF sync >>>>>>>>>>>>> Monday, June 2 · 9:00 – 10:00am >>>>>>>>>>>>> Time zone: America/Los_Angeles >>>>>>>>>>>>> Google Meet joining info >>>>>>>>>>>>> Video call link: https://meet.google.com/aui-czix-nbh >>>>>>>>>>>>> >>>>>>>>>>>>> Yufei >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, May 16, 2025 at 10:45 AM Ajantha Bhat < >>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Update on the progress. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I had a meeting today with Yufei and Yun.zou to discuss the >>>>>>>>>>>>>> UDF proposal. We covered several key points, though some are >>>>>>>>>>>>>> still open for >>>>>>>>>>>>>> further discussion: >>>>>>>>>>>>>> >>>>>>>>>>>>>> a) *UDF Versioning*: Do we truly need versioning for UDFs at >>>>>>>>>>>>>> this stage? We explored the possibility of simplifying the >>>>>>>>>>>>>> specification by >>>>>>>>>>>>>> avoiding view replication, and potentially introducing >>>>>>>>>>>>>> versioning support >>>>>>>>>>>>>> later. UDTFs, being a superset of views in some ways, may not >>>>>>>>>>>>>> require >>>>>>>>>>>>>> versioning initially. >>>>>>>>>>>>>> >>>>>>>>>>>>>> b) *VarArgs Support*: While some query engines may not >>>>>>>>>>>>>> support vararg syntax in CREATE FUNCTION, Iceberg UDFs could >>>>>>>>>>>>>> represent such arguments as lists when supported by the engine. >>>>>>>>>>>>>> >>>>>>>>>>>>>> c) *Generics in UDFs*: Since Iceberg currently doesn’t >>>>>>>>>>>>>> support generic types (e.g., object), we can only map >>>>>>>>>>>>>> engine-specific types to Iceberg types. As a result, generic >>>>>>>>>>>>>> data types >>>>>>>>>>>>>> will not be supported in the initial version. >>>>>>>>>>>>>> >>>>>>>>>>>>>> d) *Python Support*: Incorporating Python as a language for >>>>>>>>>>>>>> SQL UDFs seems promising, especially given its potential to >>>>>>>>>>>>>> resolve >>>>>>>>>>>>>> interoperability challenges. Some engines, however, require >>>>>>>>>>>>>> platform >>>>>>>>>>>>>> version and package dependency details to execute Python >>>>>>>>>>>>>> code—this should >>>>>>>>>>>>>> be captured in the specification. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Next Steps* >>>>>>>>>>>>>> I will update the proposal document with two primary UDF use >>>>>>>>>>>>>> cases: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> Policy exchange between engines >>>>>>>>>>>>>> - >>>>>>>>>>>>>> >>>>>>>>>>>>>> UDTF as a superset of view functionality >>>>>>>>>>>>>> >>>>>>>>>>>>>> The update will include corresponding syntax examples in both >>>>>>>>>>>>>> SQL and Python, and detail how each use case is represented in >>>>>>>>>>>>>> Iceberg >>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We also plan to set up regular syncs (open to more interested >>>>>>>>>>>>>> participants) to continue refining and finalizing the UDF >>>>>>>>>>>>>> specification. >>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Mar 12, 2025 at 9:16 PM Ajantha Bhat < >>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've updated the design document[1] based on the previous >>>>>>>>>>>>>>> comments. Additionally, I've included the SQL UDF syntax >>>>>>>>>>>>>>> supported by >>>>>>>>>>>>>>> various vendors, including Dremio, Snowflake, Databricks, and >>>>>>>>>>>>>>> Trino. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm happy to schedule a separate sync if a deeper discussion >>>>>>>>>>>>>>> is needed. Let's keep moving forward, especially with the >>>>>>>>>>>>>>> renewed interest >>>>>>>>>>>>>>> from the community. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit?usp=sharing >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Feb 13, 2025 at 11:17 PM Ajantha Bhat < >>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> During the last catalog community sync, there was >>>>>>>>>>>>>>>> significant interest in storing UDFs in Iceberg and adding >>>>>>>>>>>>>>>> endpoints for >>>>>>>>>>>>>>>> UDF handling in the REST catalog spec. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I recently discussed this with Yufei to better understand >>>>>>>>>>>>>>>> the new requirement of using UDFs for fine-grained access >>>>>>>>>>>>>>>> control policies. >>>>>>>>>>>>>>>> This expands the use cases beyond just versioned and >>>>>>>>>>>>>>>> interoperable UDFs. >>>>>>>>>>>>>>>> Additionally, I learnt that many vendors are interested in >>>>>>>>>>>>>>>> this feature. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Given the strong community interest and support, I’d like >>>>>>>>>>>>>>>> to take ownership of this effort and revive the work. I'll be >>>>>>>>>>>>>>>> revisiting >>>>>>>>>>>>>>>> the document I proposed long back and will share an updated >>>>>>>>>>>>>>>> proposal by >>>>>>>>>>>>>>>> next week. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Looking forward to storing UDFs in Iceberg! >>>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 2:55 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The UDF spec does not require representations to be SQL. >>>>>>>>>>>>>>>>> It merely does not specify (in this revision) how other >>>>>>>>>>>>>>>>> representations are >>>>>>>>>>>>>>>>> to be written. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This seems like an easy extension (adding a new type in >>>>>>>>>>>>>>>>> the "Representations" section). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 3:47 PM Ryan Blue >>>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Right now, SQL is an explicit requirement of the spec. It >>>>>>>>>>>>>>>>>> leaves a way for future versions to add different >>>>>>>>>>>>>>>>>> representations later, >>>>>>>>>>>>>>>>>> but only SQL is supported. That was also the feedback to my >>>>>>>>>>>>>>>>>> initial >>>>>>>>>>>>>>>>>> skepticism about how it would work to add functions. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 12:44 PM Dmitri Bourlatchkov >>>>>>>>>>>>>>>>>> <dmitri.bourlatch...@dremio.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I do not think the spec is meant to allow only SQL >>>>>>>>>>>>>>>>>>> representations, although it is certainly faviouring SQL in >>>>>>>>>>>>>>>>>>> examples... It >>>>>>>>>>>>>>>>>>> would be nice to add a non-SQL example, indeed. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>> Dmitri. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Aug 8, 2024 at 9:00 AM Fokko Driesprong < >>>>>>>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Coming from PyIceberg, I have concerns as this proposal >>>>>>>>>>>>>>>>>>>> focuses on SQL-based engines, while Python-based systems >>>>>>>>>>>>>>>>>>>> often work with >>>>>>>>>>>>>>>>>>>> data frames. Adding imperative languages like Python would >>>>>>>>>>>>>>>>>>>> make this >>>>>>>>>>>>>>>>>>>> proposal more inclusive. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Op do 8 aug 2024 om 10:27 schreef Piotr Findeisen < >>>>>>>>>>>>>>>>>>>> piotr.findei...@gmail.com>: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Walaa, thanks for asking! >>>>>>>>>>>>>>>>>>>>> In the design doc linked before in this thread [1] i >>>>>>>>>>>>>>>>>>>>> read >>>>>>>>>>>>>>>>>>>>> "Without a common standard, the UDFs are hard to share >>>>>>>>>>>>>>>>>>>>> among different engines." >>>>>>>>>>>>>>>>>>>>> ("Background and Motivation" section). >>>>>>>>>>>>>>>>>>>>> I agree with this statement. I don't fully understand >>>>>>>>>>>>>>>>>>>>> yet how the proposed design addresses shareability >>>>>>>>>>>>>>>>>>>>> between the engines >>>>>>>>>>>>>>>>>>>>> though. >>>>>>>>>>>>>>>>>>>>> I would use some help to understand this better. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Best >>>>>>>>>>>>>>>>>>>>> Piotr >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] SQL User-Defined Function Spec >>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, 7 Aug 2024 at 21:14, Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Piotr, what do you mean by making user-created >>>>>>>>>>>>>>>>>>>>>> functions shareable >>>>>>>>>>>>>>>>>>>>>> between engines? Do you mean UDFs written in >>>>>>>>>>>>>>>>>>>>>> imperative code? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 7, 2024 at 12:00 PM Piotr Findeisen >>>>>>>>>>>>>>>>>>>>>> <piotr.findei...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > Thank you Ajantha for creating this thread. The >>>>>>>>>>>>>>>>>>>>>> Iceberg UDFs are an interesting idea! >>>>>>>>>>>>>>>>>>>>>> > Is there a plan to make the user-created functions >>>>>>>>>>>>>>>>>>>>>> sharable between the engines? >>>>>>>>>>>>>>>>>>>>>> > If so, how would a CREATE FUNCTION statement look >>>>>>>>>>>>>>>>>>>>>> like in e..g Spark or Trino? >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > Meanwhile, added a few comments in the doc. >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > Best >>>>>>>>>>>>>>>>>>>>>> > Piotr >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > On Thu, 1 Aug 2024 at 20:50, Ryan Blue >>>>>>>>>>>>>>>>>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> I just looked through the proposal and added >>>>>>>>>>>>>>>>>>>>>> comments. I think it would be helpful to also have a >>>>>>>>>>>>>>>>>>>>>> design doc that covers >>>>>>>>>>>>>>>>>>>>>> the choices from the draft spec. For instance, the >>>>>>>>>>>>>>>>>>>>>> choice to enumerate all >>>>>>>>>>>>>>>>>>>>>> possible function input struts rather than allowing >>>>>>>>>>>>>>>>>>>>>> generics and varargs. >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> Here’s a quick summary of my feedback: >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> I think that the choice to enumerate function >>>>>>>>>>>>>>>>>>>>>> signatures is limiting. It would be nice to see a >>>>>>>>>>>>>>>>>>>>>> discussion of the >>>>>>>>>>>>>>>>>>>>>> trade-offs and a rationale for the choice. I think it >>>>>>>>>>>>>>>>>>>>>> would also be very >>>>>>>>>>>>>>>>>>>>>> helpful to have a few representative use cases for this >>>>>>>>>>>>>>>>>>>>>> included in the >>>>>>>>>>>>>>>>>>>>>> doc. That way the proposal can demonstrate that it >>>>>>>>>>>>>>>>>>>>>> solves those use cases >>>>>>>>>>>>>>>>>>>>>> with reasonable trade-offs. >>>>>>>>>>>>>>>>>>>>>> >> There are a few instances where this is >>>>>>>>>>>>>>>>>>>>>> inconsistent with conventions in other specs. For >>>>>>>>>>>>>>>>>>>>>> example, using string IDs >>>>>>>>>>>>>>>>>>>>>> rather than an integer. >>>>>>>>>>>>>>>>>>>>>> >> This uses a very different model for spec >>>>>>>>>>>>>>>>>>>>>> versioning than the Iceberg view and table specs. It >>>>>>>>>>>>>>>>>>>>>> requires readers to >>>>>>>>>>>>>>>>>>>>>> fail if there are any unknown fields, which prevents the >>>>>>>>>>>>>>>>>>>>>> spec from adding >>>>>>>>>>>>>>>>>>>>>> things that are fully backward-compatible. Other Iceberg >>>>>>>>>>>>>>>>>>>>>> specs only require >>>>>>>>>>>>>>>>>>>>>> a version change to introduce forward-incompatible >>>>>>>>>>>>>>>>>>>>>> changes and I think that >>>>>>>>>>>>>>>>>>>>>> this should do the same to avoid confusion. >>>>>>>>>>>>>>>>>>>>>> >> It looks like the intent is to allow multiple >>>>>>>>>>>>>>>>>>>>>> function signatures per verison, but it is unclear how >>>>>>>>>>>>>>>>>>>>>> to encode them >>>>>>>>>>>>>>>>>>>>>> because a version is associated with a single function >>>>>>>>>>>>>>>>>>>>>> signature. >>>>>>>>>>>>>>>>>>>>>> >> There is no review of SQL syntax for creating >>>>>>>>>>>>>>>>>>>>>> functions across engines, so this doesn’t show that the >>>>>>>>>>>>>>>>>>>>>> metadata proposed >>>>>>>>>>>>>>>>>>>>>> is sufficient for cross-engine use cases. >>>>>>>>>>>>>>>>>>>>>> >> The example for a table-valued function shows a >>>>>>>>>>>>>>>>>>>>>> SELECT statement and it isn’t clear how this is distinct >>>>>>>>>>>>>>>>>>>>>> from a view >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> On Thu, Aug 1, 2024 at 3:15 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>> >>> Thanks Walaa and Robert for the review on this. >>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>> >>> We didn't find any blocker for the spec. >>>>>>>>>>>>>>>>>>>>>> >>> I will wait for a week and If no more review >>>>>>>>>>>>>>>>>>>>>> comments, I will raise a PR for spec addition next week. >>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>> >>> If anyone else is interested, please have a look >>>>>>>>>>>>>>>>>>>>>> at the proposal >>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>> >>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>> >>> On Tue, Jul 16, 2024 at 1:27 PM Walaa Eldin >>>>>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> Hi Ajantha, >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> I have left some comments. It is an interesting >>>>>>>>>>>>>>>>>>>>>> direction, but there might be some details that need to >>>>>>>>>>>>>>>>>>>>>> be fine tuned. >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> The doc is here [1] for others who might be >>>>>>>>>>>>>>>>>>>>>> interested. Resharing since I do not think it was >>>>>>>>>>>>>>>>>>>>>> directly linked in the >>>>>>>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> [1] >>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1BDvOfhrH0ZQiQv9eLBqeAu8k8Vjfmeql9VzIiW1F0vc/edit >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> >>>> Walaa. >>>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>>> >>>> On Mon, Jul 15, 2024 at 11:09 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>> Hi, just another reminder since we didn't get >>>>>>>>>>>>>>>>>>>>>> any review on the proposal. >>>>>>>>>>>>>>>>>>>>>> >>>>> Initially proposed on June 4. >>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>> On Mon, Jun 24, 2024 at 4:21 PM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>> Hi everyone, >>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>> We've only received one review so far (from >>>>>>>>>>>>>>>>>>>>>> Benny). >>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>> We would appreciate more eyes on this. >>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>> On Tue, Jun 4, 2024 at 7:25 AM Ajantha Bhat < >>>>>>>>>>>>>>>>>>>>>> ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hi All, >>>>>>>>>>>>>>>>>>>>>> >>>>>>> Please find the proposal link >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10432 >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>> Google doc link is attached in the proposal. >>>>>>>>>>>>>>>>>>>>>> >>>>>>> And Thanks Stephen Lin for working on it. >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>> Hope it gives more clarity to take the >>>>>>>>>>>>>>>>>>>>>> decisions and how we want to implement it. >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>> On Wed, May 29, 2024 at 4:01 AM Walaa Eldin >>>>>>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks Jack. I actually meant >>>>>>>>>>>>>>>>>>>>>> scalar/aggregate/table user defined functions. Here are >>>>>>>>>>>>>>>>>>>>>> some examples of >>>>>>>>>>>>>>>>>>>>>> what I meant in (2): >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Hive GenericUDF: >>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Trino user defined functions: >>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/develop/functions.html >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Flink user defined functions: >>>>>>>>>>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/udfs/ >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Probably what you referred to is a variation >>>>>>>>>>>>>>>>>>>>>> of (1) where the API is data flow/data pipeline API >>>>>>>>>>>>>>>>>>>>>> instead of SQL (e.g., >>>>>>>>>>>>>>>>>>>>>> Spark Scala). Yes, that is also possible in the very >>>>>>>>>>>>>>>>>>>>>> long run :) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> On Tue, May 28, 2024 at 2:57 PM Jack Ye < >>>>>>>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> > (2) Custom code written in imperative >>>>>>>>>>>>>>>>>>>>>> function according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> I think we could still explore some long >>>>>>>>>>>>>>>>>>>>>> term opportunities in this case. Consider you register a >>>>>>>>>>>>>>>>>>>>>> Spark temp view as >>>>>>>>>>>>>>>>>>>>>> some sort of data frame read, then it could still be >>>>>>>>>>>>>>>>>>>>>> resolved to a Spark >>>>>>>>>>>>>>>>>>>>>> plan that is representable by an intermediate >>>>>>>>>>>>>>>>>>>>>> representation. But I agree >>>>>>>>>>>>>>>>>>>>>> this gets very complicated very soon, and just having >>>>>>>>>>>>>>>>>>>>>> the case (1) covered >>>>>>>>>>>>>>>>>>>>>> would already be a huge step forward. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> -Jack >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> On Tue, May 28, 2024 at 1:40 PM Benny Chow < >>>>>>>>>>>>>>>>>>>>>> btc...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> It's interesting to note that a tabular >>>>>>>>>>>>>>>>>>>>>> SQL UDF can be used to build a parameterized view. So, >>>>>>>>>>>>>>>>>>>>>> there's definitely >>>>>>>>>>>>>>>>>>>>>> a lot in common between UDFs and views. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> On Tue, May 28, 2024 at 9:53 AM Walaa >>>>>>>>>>>>>>>>>>>>>> Eldin Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> I think there is a disconnect about what >>>>>>>>>>>>>>>>>>>>>> is perceived as a "UDF". There are 2 flavors: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (1) Functions that are defined by the >>>>>>>>>>>>>>>>>>>>>> user whose definition is a composition of other built-in >>>>>>>>>>>>>>>>>>>>>> functions/SQL >>>>>>>>>>>>>>>>>>>>>> expressions. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> (2) Custom code written in imperative >>>>>>>>>>>>>>>>>>>>>> function according to a Java/Scala/Python API, etc. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> All the examples in Ajantha's references >>>>>>>>>>>>>>>>>>>>>> are pretty much from (1) and I think those have more >>>>>>>>>>>>>>>>>>>>>> analogy to views due >>>>>>>>>>>>>>>>>>>>>> to their SQL nature. Agree (2) is not practical to >>>>>>>>>>>>>>>>>>>>>> maintain by Iceberg, but >>>>>>>>>>>>>>>>>>>>>> I think Ajantha's use cases are around (1), and may be >>>>>>>>>>>>>>>>>>>>>> worth evaluating. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2024 at 9:45 AM Ajantha >>>>>>>>>>>>>>>>>>>>>> Bhat <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we'll know more when you post >>>>>>>>>>>>>>>>>>>>>> the proposal, but I think this would be a very difficult >>>>>>>>>>>>>>>>>>>>>> area to tackle >>>>>>>>>>>>>>>>>>>>>> across engines, languages, and memory models without >>>>>>>>>>>>>>>>>>>>>> having a huge >>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Assuming Iceberg initially supports SQL >>>>>>>>>>>>>>>>>>>>>> representations of UDFs (similar to views as shared by >>>>>>>>>>>>>>>>>>>>>> the reference links >>>>>>>>>>>>>>>>>>>>>> above), the complexity involved will be similar to >>>>>>>>>>>>>>>>>>>>>> managing views. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, Ryan, Robert, and Jack, for your >>>>>>>>>>>>>>>>>>>>>> input. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> We will work on publishing the draft >>>>>>>>>>>>>>>>>>>>>> spec (inspired by the view spec) this week to facilitate >>>>>>>>>>>>>>>>>>>>>> further >>>>>>>>>>>>>>>>>>>>>> discussions. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 28, 2024 at 7:33 PM Jack Ye < >>>>>>>>>>>>>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> > While it would be great to have a >>>>>>>>>>>>>>>>>>>>>> common set of functions across engines, I don't see how >>>>>>>>>>>>>>>>>>>>>> that is practical >>>>>>>>>>>>>>>>>>>>>> when those engines are implemented so differently. >>>>>>>>>>>>>>>>>>>>>> Plugging in code -- and >>>>>>>>>>>>>>>>>>>>>> especially custom user-supplied code -- seems inherently >>>>>>>>>>>>>>>>>>>>>> specialized to me >>>>>>>>>>>>>>>>>>>>>> and should be part of the engines' design. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> How is this different from the views? I >>>>>>>>>>>>>>>>>>>>>> feel we can say exactly the same thing for Iceberg >>>>>>>>>>>>>>>>>>>>>> views, but yet we have >>>>>>>>>>>>>>>>>>>>>> Iceberg multi-dialect views implemented. Maybe it sounds >>>>>>>>>>>>>>>>>>>>>> like we are trying >>>>>>>>>>>>>>>>>>>>>> to draw a line between SQL vs other programming language >>>>>>>>>>>>>>>>>>>>>> as "code"? but I >>>>>>>>>>>>>>>>>>>>>> think SQL is just another type of code, and we are >>>>>>>>>>>>>>>>>>>>>> already talking about >>>>>>>>>>>>>>>>>>>>>> compiling all these different code dialects to an >>>>>>>>>>>>>>>>>>>>>> intermediate >>>>>>>>>>>>>>>>>>>>>> representation (using projects like Coral, Substrait), >>>>>>>>>>>>>>>>>>>>>> which will be stored >>>>>>>>>>>>>>>>>>>>>> as another type of representation of Iceberg view. I >>>>>>>>>>>>>>>>>>>>>> think the same >>>>>>>>>>>>>>>>>>>>>> functionality can be used for UDFs if developed. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I actually hink adding UDF support is a >>>>>>>>>>>>>>>>>>>>>> good idea, even just a multi-dialect one like view, and >>>>>>>>>>>>>>>>>>>>>> that can allow >>>>>>>>>>>>>>>>>>>>>> engines to for example parse a view SQL, and when a >>>>>>>>>>>>>>>>>>>>>> function referenced >>>>>>>>>>>>>>>>>>>>>> cannot be resolved, try to seek for a multi-dialect UDF >>>>>>>>>>>>>>>>>>>>>> definition. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> I guess we can discuss more when we >>>>>>>>>>>>>>>>>>>>>> have the actual proposal published. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 28, 2024 at 1:32 AM Robert >>>>>>>>>>>>>>>>>>>>>> Stupp <sn...@snazy.de> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> UDFs are as engine specific and >>>>>>>>>>>>>>>>>>>>>> portable and "non-centralized" as views are. The same >>>>>>>>>>>>>>>>>>>>>> performance concerns >>>>>>>>>>>>>>>>>>>>>> apply to views as well. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Iceberg should define a common base >>>>>>>>>>>>>>>>>>>>>> upon which engines can build, so the argument that UDFs >>>>>>>>>>>>>>>>>>>>>> aren't practical, >>>>>>>>>>>>>>>>>>>>>> because engines are different, is probably only a >>>>>>>>>>>>>>>>>>>>>> temporary concern. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> In the long term, Iceberg should also >>>>>>>>>>>>>>>>>>>>>> try to tackle the idea to make views portable, which is >>>>>>>>>>>>>>>>>>>>>> conceptually not >>>>>>>>>>>>>>>>>>>>>> that much different from portable UDFs. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> PS: I'm not a fan of adding a negative >>>>>>>>>>>>>>>>>>>>>> touch to the idea of having UDFs in Iceberg, especially >>>>>>>>>>>>>>>>>>>>>> not in this early >>>>>>>>>>>>>>>>>>>>>> stage. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On 24.05.24 20:53, Ryan Blue wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Ajantha. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm skeptical about whether it's a >>>>>>>>>>>>>>>>>>>>>> good idea to add UDFs tracked by Iceberg catalogs. I >>>>>>>>>>>>>>>>>>>>>> think that Iceberg >>>>>>>>>>>>>>>>>>>>>> primarily deals with things that are centralized, like >>>>>>>>>>>>>>>>>>>>>> tables of data. >>>>>>>>>>>>>>>>>>>>>> While it would be great to have a common set of >>>>>>>>>>>>>>>>>>>>>> functions across engines, I >>>>>>>>>>>>>>>>>>>>>> don't see how that is practical when those engines are >>>>>>>>>>>>>>>>>>>>>> implemented so >>>>>>>>>>>>>>>>>>>>>> differently. Plugging in code -- and especially custom >>>>>>>>>>>>>>>>>>>>>> user-supplied code >>>>>>>>>>>>>>>>>>>>>> -- seems inherently specialized to me and should be part >>>>>>>>>>>>>>>>>>>>>> of the engines' >>>>>>>>>>>>>>>>>>>>>> design. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess we'll know more when you post >>>>>>>>>>>>>>>>>>>>>> the proposal, but I think this would be a very difficult >>>>>>>>>>>>>>>>>>>>>> area to tackle >>>>>>>>>>>>>>>>>>>>>> across engines, languages, and memory models without >>>>>>>>>>>>>>>>>>>>>> having a huge >>>>>>>>>>>>>>>>>>>>>> performance penalty. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, May 24, 2024 at 8:10 AM >>>>>>>>>>>>>>>>>>>>>> Ajantha Bhat <ajanthab...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a discussion to gauge the >>>>>>>>>>>>>>>>>>>>>> community interest in storing the Versioned SQL UDFs in >>>>>>>>>>>>>>>>>>>>>> Iceberg. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to propose the spec addition >>>>>>>>>>>>>>>>>>>>>> for storing the versioned UDFs in Iceberg (inspired by >>>>>>>>>>>>>>>>>>>>>> view spec). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> These UDFs can operate similarly to >>>>>>>>>>>>>>>>>>>>>> views in that they are associated with tables, but they >>>>>>>>>>>>>>>>>>>>>> can accept >>>>>>>>>>>>>>>>>>>>>> arguments and produce return values, or even function as >>>>>>>>>>>>>>>>>>>>>> inline expressions. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many Query engines like Dremio, >>>>>>>>>>>>>>>>>>>>>> Trino, Snowflake, Databricks Spark supports SQL UDFs at >>>>>>>>>>>>>>>>>>>>>> catalog level [1]. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But storing them in Iceberg can enable >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Versioning of these UDFs. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Interoperability between the >>>>>>>>>>>>>>>>>>>>>> engines. Potentially engines can understand the UDFs >>>>>>>>>>>>>>>>>>>>>> written by other >>>>>>>>>>>>>>>>>>>>>> engines (with the translate layer). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that integrating this >>>>>>>>>>>>>>>>>>>>>> feature into Iceberg would be a valuable addition, and >>>>>>>>>>>>>>>>>>>>>> we're eager to >>>>>>>>>>>>>>>>>>>>>> collaborate with the community to develop a UDF >>>>>>>>>>>>>>>>>>>>>> specification. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stephen has already begun drafting a >>>>>>>>>>>>>>>>>>>>>> specification to propose to the community. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let us know your thoughts on this. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dremio - >>>>>>>>>>>>>>>>>>>>>> https://docs.dremio.com/current/reference/sql/commands/functions#creating-a-function >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Trino - >>>>>>>>>>>>>>>>>>>>>> https://trino.io/docs/current/sql/create-function.html >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Snowflake - >>>>>>>>>>>>>>>>>>>>>> https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-scalar-functions >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Databricks - >>>>>>>>>>>>>>>>>>>>>> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Ajantha >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Tabular >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Stupp >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> @snazy >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>>>>>>>>> >> Ryan Blue >>>>>>>>>>>>>>>>>>>>>> >> Databricks >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>> Databricks >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>