Re: [DISCUSS] SPIP: FunctionCatalog

Dongjoon Hyun Wed, 17 Feb 2021 09:10:09 -0800

Thank you so much for sharing the progress, Wenchen! Also, thank you,
Hyukjin.


Bests,
Dongjoon.

On Wed, Feb 17, 2021 at 2:49 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I did a simple benchmark (adding two long values) to compare the
> performance between
> 1. native expression
> 2. the current UDF
> 3. new UDF with individual parameters
> 4. new UDF with a row parameter (with the row object cached)
> 5. invoke a static method (to explore the possibility of speeding up
> stateless UDF, not very related to the current topic)
>
> The benchmark code can be found here
> <https://gist.github.com/cloud-fan/f88baf770fa0c6f9ad312e8c92ff6c21>. The
> result is
>
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
> Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
> UDF perf:                                 Best Time(ms)   Avg Time(ms)
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
>
> ------------------------------------------------------------------------------------------------------------------------
> native add                                        14206          14516
>     535         70.4          14.2       1.0X
> udf add                                           24609          25271
>     898         40.6          24.6       0.6X
> new udf add                                       18657          19096
>     726         53.6          18.7       0.8X
> new row udf add                                   21128          22343
>    1478         47.3          21.1       0.7X
> static udf add                                    16678          16887
>     278         60.0          16.7       0.9X
>
>
> The new UDF with individual parameters is faster than the current UDF,
> because the virtual function call is eliminated. It's also faster than the
> row parameter version because of no overhead to set/get row fields.
>
> I prefer the individual-parameters version, not only because of the
> performance gain (10% is not a big win), but also because:
> 1. It's coherent with the current Scala/Java UDF API
> 2. It's simpler for developers to write simple UDFs (parameters are the
> input columns directly).
> 3. It's possible to allow multiple java types for one catalyst type, e.g.
> allowing both String and UTF8String, which is more flexible.
>
> One major issue is not supporting varargs, but I'm not sure how
> important this feature is. As I mentioned before, users can work around it
> by accepting struct-type input and use the `struct` function to build the
> input column. The current Scala/Java UDF doesn't support varargs either,
> the same to Presto/Transport.
>
> I'm fine to have an optional trait or flag to support varargs by accepting
> InternalRow as the input, if there are user requests.
>
> About debugging, I don't see a big issue here as the process of calling
> the new UDF is very similar to the current Scala/Java UDF. Please let me
> know if there are existing complaints about debugging the current
> Scala/Java UDF. I think the row-parameter version is even harder to debug,
> as the column binding happens in the user code (e.g. row.getLong(index))
> which is totally runtime, while the individual-parameters version has a
> query-compile-time check to make sure the function signature matches the
> input columns.
>
> I can help to come up with detailed rules about null handling, type
> matching, etc. for the individual-parameters UDF, if we all agree with this
> direction.
>
> Last but not least, calling methods via reflection (searching the method
> handler only needs to be done once per task) is not that slow in modern
> JVMs. Non-codegen is like 10x slower and I don't think a bit overhead in
> Java reflection matters.
>
>
>
> On Wed, Feb 17, 2021 at 3:07 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Just to make sure we don’t move past, I think we haven’t decided yet:
>>
>>    - if we’ll replace the current proposal to Wenchen’s approach as the
>>    default
>>    - if we want to have Wenchen’s approach as an optional mix-in on the
>>    top of Ryan’s proposal (SupportsInvoke)
>>
>> From what I read, some people pointed out it as a replacement. Please
>> correct me if I misread this discussion thread.
>> As Dongjoon pointed out, it would be good to know rough ETA to make sure
>> making progress in this, and people can compare more easily.
>>
>>
>> FWIW, there’s the saying I like in the zen of Python
>> <https://www.python.org/dev/peps/pep-0020/>:
>>
>> There should be one— and preferably only one —obvious way to do it.
>>
>> If multiple approaches have the way for developers to do the (almost)
>> same thing, I would prefer to avoid it.
>>
>> In addition, I would prefer to focus on what Spark does by default first.
>>
>>
>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <dongjoon.h...@gmail.com>님이 작성:
>>
>>> Hi, Wenchen.
>>>
>>> This thread seems to get enough attention. Also, I'm expecting more and
>>> more if we have this on the `master` branch because we are developing
>>> together.
>>>
>>>     > Spark SQL has many active contributors/committers and this thread
>>> doesn't get much attention yet.
>>>
>>> So, what's your ETA from now?
>>>
>>>     > I think the problem here is we were discussing some very detailed
>>> things without actual code.
>>>     > I'll implement my idea after the holiday and then we can have more
>>> effective discussions.
>>>     > We can also do benchmarks and get some real numbers.
>>>     > In the meantime, we can continue to discuss other parts of this
>>> proposal, and make a prototype if possible.
>>>
>>> I'm looking forward to seeing your PR. I hope we can conclude this
>>> thread and have at least one implementation in the `master` branch this
>>> month (February).
>>> If you need more time (one month or longer), why don't we have Ryan's
>>> suggestion in the `master` branch first and benchmark with your PR later
>>> during Apache Spark 3.2 timeframe.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Andrew,
>>>>
>>>> The proposal already includes an API for aggregate functions and I
>>>> think we would want to implement those right away.
>>>>
>>>> Processing ColumnBatch is something we can easily extend the interfaces
>>>> to support, similar to Wenchen's suggestion. The important thing right now
>>>> is to agree on some basic functionality: how to look up functions and what
>>>> the simple API should be. Like the TableCatalog interfaces, we will layer
>>>> on more support through optional interfaces like `SupportsInvoke` or
>>>> `SupportsColumnBatch`.
>>>>
>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo <andrew.m...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Ryan,
>>>>>
>>>>> This proposal looks very interesting. Would future goals for this
>>>>> functionality include both support for aggregation functions, as well
>>>>> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>>>>>
>>>>> Thanks
>>>>> Andrew
>>>>>
>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>> >
>>>>> > Thanks for the positive feedback, everyone. It sounds like there is
>>>>> a clear path forward for calling functions. Even without a prototype, the
>>>>> `invoke` plans show that Wenchen's suggested optimization can be done, and
>>>>> incorporating it as an optional extension to this proposal solves many of
>>>>> the unknowns.
>>>>> >
>>>>> > With that area now understood, is there any discussion about other
>>>>> parts of the proposal, besides the function call interface?
>>>>> >
>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun <sunc...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >> This is an important feature which can unblock several other
>>>>> projects including bucket join support for DataSource v2, complete support
>>>>> for enforcing DataSource v2 distribution requirements on the write path,
>>>>> etc. I like Ryan's proposals which look simple and elegant, with nice
>>>>> support on function overloading and variadic arguments. On the other hand,
>>>>> I think Wenchen made a very good point about performance. Overall, I'm
>>>>> excited to see active discussions on this topic and believe the community
>>>>> will come to a proposal with the best of both sides.
>>>>> >>
>>>>> >> Chao
>>>>> >>
>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon <gurwls...@gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> +1 for Liang-chi's.
>>>>> >>>
>>>>> >>> Thanks Ryan and Wenchen for leading this.
>>>>> >>>
>>>>> >>>
>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh <vii...@gmail.com>님이
>>>>> 작성:
>>>>> >>>>
>>>>> >>>> Basically I think the proposal makes sense to me and I'd like to
>>>>> support the
>>>>> >>>> SPIP as it looks like we have strong need for the important
>>>>> feature.
>>>>> >>>>
>>>>> >>>> Thanks Ryan for working on this and I do also look forward to
>>>>> Wenchen's
>>>>> >>>> implementation. Thanks for the discussion too.
>>>>> >>>>
>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>>>> >>>> alternative to me. Besides Wenchen's alternative implementation,
>>>>> is there a
>>>>> >>>> chance we also have the SupportsInvoke for comparison?
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> John Zhuge wrote
>>>>> >>>> > Excited to see our Spark community rallying behind this
>>>>> important feature!
>>>>> >>>> >
>>>>> >>>> > The proposal lays a solid foundation of minimal feature set
>>>>> with careful
>>>>> >>>> > considerations for future optimizations and extensions. Can't
>>>>> wait to see
>>>>> >>>> > it leading to more advanced functionalities like views with
>>>>> shared custom
>>>>> >>>> > functions, function pushdown, lambda, etc. It has already borne
>>>>> fruit from
>>>>> >>>> > the constructive collaborations in this thread. Looking forward
>>>>> to
>>>>> >>>> > Wenchen's prototype and further discussions including the
>>>>> SupportsInvoke
>>>>> >>>> > extension proposed by Ryan.
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley &lt;
>>>>> >>>>
>>>>> >>>> > owen.omalley@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> > wrote:
>>>>> >>>> >
>>>>> >>>> >> I think this proposal is a very good thing giving Spark a
>>>>> standard way of
>>>>> >>>> >> getting to and calling UDFs.
>>>>> >>>> >>
>>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs.
>>>>> It is
>>>>> >>>> >> simple, yet covers all of the polymorphic type cases well. I
>>>>> think it
>>>>> >>>> >> would
>>>>> >>>> >> also simplify using the functions in other contexts like
>>>>> pushing down
>>>>> >>>> >> filters into the ORC & Parquet readers although there are a
>>>>> lot of
>>>>> >>>> >> details
>>>>> >>>> >> that would need to be considered there.
>>>>> >>>> >>
>>>>> >>>> >> .. Owen
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen &lt;
>>>>> >>>>
>>>>> >>>> > ekrogen@.com
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >> wrote:
>>>>> >>>> >>
>>>>> >>>> >>> I agree that there is a strong need for a FunctionCatalog
>>>>> within Spark
>>>>> >>>> >>> to
>>>>> >>>> >>> provide support for shareable UDFs, as well as make movement
>>>>> towards
>>>>> >>>> >>> more
>>>>> >>>> >>> advanced functionality like views which themselves depend on
>>>>> UDFs, so I
>>>>> >>>> >>> support this SPIP wholeheartedly.
>>>>> >>>> >>>
>>>>> >>>> >>> I find both of the proposed UDF APIs to be sufficiently
>>>>> user-friendly
>>>>> >>>> >>> and
>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is easier
>>>>> for a user to
>>>>> >>>> >>> work with in the common case, but has greater potential for
>>>>> confusing
>>>>> >>>> >>> and
>>>>> >>>> >>> hard-to-debug behavior due to use of reflective method
>>>>> signature
>>>>> >>>> >>> searches.
>>>>> >>>> >>> The merits on both sides can hopefully be more properly
>>>>> examined with
>>>>> >>>> >>> code,
>>>>> >>>> >>> so I look forward to seeing an implementation of Wenchen's
>>>>> ideas to
>>>>> >>>> >>> provide
>>>>> >>>> >>> a more concrete comparison. I am optimistic that we will not
>>>>> let the
>>>>> >>>> >>> debate
>>>>> >>>> >>> over this point unreasonably stall the SPIP from making
>>>>> progress.
>>>>> >>>> >>>
>>>>> >>>> >>> Thank you to both Wenchen and Ryan for your detailed
>>>>> consideration and
>>>>> >>>> >>> evaluation of these ideas!
>>>>> >>>> >>> ------------------------------
>>>>> >>>> >>> *From:* Dongjoon Hyun &lt;
>>>>> >>>>
>>>>> >>>> > dongjoon.hyun@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>>>> >>>> >>> *To:* Ryan Blue &lt;
>>>>> >>>>
>>>>> >>>> > blue@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >>> *Cc:* Holden Karau &lt;
>>>>> >>>>
>>>>> >>>> > holden@
>>>>> >>>>
>>>>> >>>> > &gt;; Hyukjin Kwon <
>>>>> >>>> >>>
>>>>> >>>>
>>>>> >>>> > gurwls223@
>>>>> >>>>
>>>>> >>>> >>; Spark Dev List &lt;
>>>>> >>>>
>>>>> >>>> > dev@.apache
>>>>> >>>>
>>>>> >>>> > &gt;; Wenchen Fan
>>>>> >>>> >>> &lt;
>>>>> >>>>
>>>>> >>>> > cloud0fan@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>>> >>>> >>>
>>>>> >>>> >>> BTW, I forgot to add my opinion explicitly in this thread
>>>>> because I was
>>>>> >>>> >>> on the PR before this thread.
>>>>> >>>> >>>
>>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and
>>>>> has been
>>>>> >>>> >>> there for almost two years.
>>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday because I
>>>>> agreed with
>>>>> >>>> >>> the latest updated design docs and AS-IS PR.
>>>>> >>>> >>>
>>>>> >>>> >>> And, the rest of the progress in this thread is also very
>>>>> satisfying to
>>>>> >>>> >>> me.
>>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>>> >>>> >>>
>>>>> >>>> >>> To All:
>>>>> >>>> >>> Please take a look at the design doc and the PR, and give us
>>>>> some
>>>>> >>>> >>> opinions.
>>>>> >>>> >>> We really need your participation in order to make DSv2 more
>>>>> complete.
>>>>> >>>> >>> This will unblock other DSv2 features, too.
>>>>> >>>> >>>
>>>>> >>>> >>> Bests,
>>>>> >>>> >>> Dongjoon.
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun &lt;
>>>>> >>>>
>>>>> >>>> > dongjoon.hyun@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >>> wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> Hi, Ryan.
>>>>> >>>> >>>
>>>>> >>>> >>> We didn't move past anything (both yours and Wenchen's). What
>>>>> Wenchen
>>>>> >>>> >>> suggested is double-checking the alternatives with the
>>>>> implementation to
>>>>> >>>> >>> give more momentum to our discussion.
>>>>> >>>> >>>
>>>>> >>>> >>> Your new suggestion about optional extention also sounds like
>>>>> a new
>>>>> >>>> >>> reasonable alternative to me.
>>>>> >>>> >>>
>>>>> >>>> >>> We are still discussing this topic together and I hope we can
>>>>> make a
>>>>> >>>> >>> conclude at this time (for Apache Spark 3.2) without being
>>>>> stucked like
>>>>> >>>> >>> last time.
>>>>> >>>> >>>
>>>>> >>>> >>> I really appreciate your leadership in this dicsussion and
>>>>> the moving
>>>>> >>>> >>> direction of this discussion looks constructive to me. Let's
>>>>> give some
>>>>> >>>> >>> time
>>>>> >>>> >>> to the alternatives.
>>>>> >>>> >>>
>>>>> >>>> >>> Bests,
>>>>> >>>> >>> Dongjoon.
>>>>> >>>> >>>
>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue &lt;
>>>>> >>>>
>>>>> >>>> > blue@
>>>>> >>>>
>>>>> >>>> > &gt; wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> I don’t think we should so quickly move past the drawbacks of
>>>>> this
>>>>> >>>> >>> approach. The problems are significant enough that using
>>>>> invoke is not
>>>>> >>>> >>> sufficient on its own. But, I think we can add it as an
>>>>> optional
>>>>> >>>> >>> extension
>>>>> >>>> >>> to shore up the weaknesses.
>>>>> >>>> >>>
>>>>> >>>> >>> Here’s a summary of the drawbacks:
>>>>> >>>> >>>
>>>>> >>>> >>>    - Magic function signatures are error-prone
>>>>> >>>> >>>    - Spark would need considerable code to help users find
>>>>> what went
>>>>> >>>> >>>    wrong
>>>>> >>>> >>>    - Spark would likely need to coerce arguments (e.g.,
>>>>> String,
>>>>> >>>> >>>    Option[Int]) for usability
>>>>> >>>> >>>    - It is unclear how Spark will find the Java Method to call
>>>>> >>>> >>>    - Use cases that require varargs fall back to casting;
>>>>> users will
>>>>> >>>> >>>    also get this wrong (cast to String instead of UTF8String)
>>>>> >>>> >>>    - The non-codegen path is significantly slower
>>>>> >>>> >>>
>>>>> >>>> >>> The benefit of invoke is to avoid moving data into a row,
>>>>> like this:
>>>>> >>>> >>>
>>>>> >>>> >>> -- using invoke
>>>>> >>>> >>> int result = udfFunction(x, y)
>>>>> >>>> >>>
>>>>> >>>> >>> -- using row
>>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] = x;
>>>>> >>>> >>> udfRow.update(1, y);
>>>>> >>>> >>> int result = udfFunction(udfRow);
>>>>> >>>> >>>
>>>>> >>>> >>> And, again, that won’t actually help much in cases that
>>>>> require varargs.
>>>>> >>>> >>>
>>>>> >>>> >>> I suggest we add a new marker trait for BoundMethod called
>>>>> >>>> >>> SupportsInvoke.
>>>>> >>>> >>> If that interface is implemented, then Spark will look for a
>>>>> method that
>>>>> >>>> >>> matches the expected signature based on the bound input type.
>>>>> If it
>>>>> >>>> >>> isn’t
>>>>> >>>> >>> found, Spark can print a warning and fall back to the
>>>>> InternalRow call:
>>>>> >>>> >>> “Cannot find udfFunction(int, int)”.
>>>>> >>>> >>>
>>>>> >>>> >>> This approach allows the invoke optimization, but solves many
>>>>> of the
>>>>> >>>> >>> problems:
>>>>> >>>> >>>
>>>>> >>>> >>>    - The method to invoke is found using the proposed load
>>>>> and bind
>>>>> >>>> >>>    approach
>>>>> >>>> >>>    - Magic function signatures are optional and do not cause
>>>>> runtime
>>>>> >>>> >>>    failures
>>>>> >>>> >>>    - Because this is an optional optimization, Spark can be
>>>>> more strict
>>>>> >>>> >>>    about types
>>>>> >>>> >>>    - Varargs cases can still use rows
>>>>> >>>> >>>    - Non-codegen can use an evaluation method rather than
>>>>> falling back
>>>>> >>>> >>>    to slow Java reflection
>>>>> >>>> >>>
>>>>> >>>> >>> This seems like a good extension to me; this provides a plan
>>>>> for
>>>>> >>>> >>> optimizing the UDF call to avoid building a row, while the
>>>>> existing
>>>>> >>>> >>> proposal covers the other cases well and addresses how to
>>>>> locate these
>>>>> >>>> >>> function calls.
>>>>> >>>> >>>
>>>>> >>>> >>> This also highlights that the approach used in DSv2 and this
>>>>> proposal is
>>>>> >>>> >>> working: start small and use extensions to layer on more
>>>>> complex
>>>>> >>>> >>> support.
>>>>> >>>> >>>
>>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun &lt;
>>>>> >>>>
>>>>> >>>> > dongjoon.hyun@
>>>>> >>>>
>>>>> >>>> > &gt;
>>>>> >>>> >>> wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> Thank you all for making a giant move forward for Apache
>>>>> Spark 3.2.0.
>>>>> >>>> >>> I'm really looking forward to seeing Wenchen's implementation.
>>>>> >>>> >>> That would be greatly helpful to make a decision!
>>>>> >>>> >>>
>>>>> >>>> >>> > I'll implement my idea after the holiday and then we can
>>>>> have
>>>>> >>>> >>> more effective discussions. We can also do benchmarks and get
>>>>> some real
>>>>> >>>> >>> numbers.
>>>>> >>>> >>> > FYI: the Presto UDF API
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>> also
>>>>> >>>> >>> takes individual parameters instead of the row parameter. I
>>>>> think this
>>>>> >>>> >>> direction at least worth a try so that we can see the
>>>>> performance
>>>>> >>>> >>> difference. It's also mentioned in the design doc as an
>>>>> alternative
>>>>> >>>> >>> (Trino).
>>>>> >>>> >>>
>>>>> >>>> >>> Bests,
>>>>> >>>> >>> Dongjoon.
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan &lt;
>>>>> >>>>
>>>>> >>>> > cloud0fan@
>>>>> >>>>
>>>>> >>>> > &gt; wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> FYI: the Presto UDF API
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>> also takes individual parameters instead of the row
>>>>> parameter. I think
>>>>> >>>> >>> this
>>>>> >>>> >>> direction at least worth a try so that we can see the
>>>>> performance
>>>>> >>>> >>> difference. It's also mentioned in the design doc as an
>>>>> alternative
>>>>> >>>> >>> (Trino).
>>>>> >>>> >>>
>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan &lt;
>>>>> >>>>
>>>>> >>>> > cloud0fan@
>>>>> >>>>
>>>>> >>>> > &gt; wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> Hi Holden,
>>>>> >>>> >>>
>>>>> >>>> >>> As Hyukjin said, following existing designs is not the
>>>>> principle of DS
>>>>> >>>> >>> v2
>>>>> >>>> >>> API design. We should make sure the DS v2 API makes sense.
>>>>> AFAIK we
>>>>> >>>> >>> didn't
>>>>> >>>> >>> fully follow the catalog API design from Hive and I believe
>>>>> Ryan also
>>>>> >>>> >>> agrees with it.
>>>>> >>>> >>>
>>>>> >>>> >>> I think the problem here is we were discussing some very
>>>>> detailed things
>>>>> >>>> >>> without actual code. I'll implement my idea after the holiday
>>>>> and then
>>>>> >>>> >>> we
>>>>> >>>> >>> can have more effective discussions. We can also do
>>>>> benchmarks and get
>>>>> >>>> >>> some
>>>>> >>>> >>> real numbers.
>>>>> >>>> >>>
>>>>> >>>> >>> In the meantime, we can continue to discuss other parts of
>>>>> this
>>>>> >>>> >>> proposal,
>>>>> >>>> >>> and make a prototype if possible. Spark SQL has many active
>>>>> >>>> >>> contributors/committers and this thread doesn't get much
>>>>> attention yet.
>>>>> >>>> >>>
>>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon &lt;
>>>>> >>>>
>>>>> >>>> > gurwls223@
>>>>> >>>>
>>>>> >>>> > &gt; wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> Just dropping a few lines. I remember that one of the goals
>>>>> in DSv2 is
>>>>> >>>> >>> to
>>>>> >>>> >>> correct the mistakes we made in the current Spark codes.
>>>>> >>>> >>> It would not have much point if we will happen to just follow
>>>>> and mimic
>>>>> >>>> >>> what Spark currently does. It might just end up with another
>>>>> copy of
>>>>> >>>> >>> Spark
>>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely would like
>>>>> to avoid
>>>>> >>>> >>> this
>>>>> >>>> >>> I do believe we have been stuck mainly due to trying to come
>>>>> up with a
>>>>> >>>> >>> better design. We already have an ugly picture of the current
>>>>> Spark APIs
>>>>> >>>> >>> to
>>>>> >>>> >>> draw a better bigger picture.
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau &lt;
>>>>> >>>>
>>>>> >>>> > holden@
>>>>> >>>>
>>>>> >>>> > &gt;님이 작성:
>>>>> >>>> >>>
>>>>> >>>> >>> I think this proposal is a good set of trade-offs and has
>>>>> existed in the
>>>>> >>>> >>> community for a long period of time. I especially appreciate
>>>>> how the
>>>>> >>>> >>> design
>>>>> >>>> >>> is focused on a minimal useful component, with future
>>>>> optimizations
>>>>> >>>> >>> considered from a point of view of making sure it's flexible,
>>>>> but actual
>>>>> >>>> >>> concrete decisions left for the future once we see how this
>>>>> API is used.
>>>>> >>>> >>> I
>>>>> >>>> >>> think if we try and optimize everything right out of the
>>>>> gate, we'll
>>>>> >>>> >>> quickly get stuck (again) and not make any progress.
>>>>> >>>> >>>
>>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue &lt;
>>>>> >>>>
>>>>> >>>> > blue@
>>>>> >>>>
>>>>> >>>> > &gt; wrote:
>>>>> >>>> >>>
>>>>> >>>> >>> Hi everyone,
>>>>> >>>> >>>
>>>>> >>>> >>> I'd like to start a discussion for adding a FunctionCatalog
>>>>> interface to
>>>>> >>>> >>> catalog plugins. This will allow catalogs to expose functions
>>>>> to Spark,
>>>>> >>>> >>> similar to how the TableCatalog interface allows a catalog to
>>>>> expose
>>>>> >>>> >>> tables. The proposal doc is available here:
>>>>> >>>> >>>
>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>>
>>>>> >>>> >>> Here's a high-level summary of some of the main design
>>>>> choices:
>>>>> >>>> >>> * Adds the ability to list and load functions, not to create
>>>>> or modify
>>>>> >>>> >>> them in an external catalog
>>>>> >>>> >>> * Supports scalar, aggregate, and partial aggregate functions
>>>>> >>>> >>> * Uses load and bind steps for better error messages and
>>>>> simpler
>>>>> >>>> >>> implementations
>>>>> >>>> >>> * Like the DSv2 table read and write APIs, it uses
>>>>> InternalRow to pass
>>>>> >>>> >>> data
>>>>> >>>> >>> * Can be extended using mix-in interfaces to add
>>>>> vectorization, codegen,
>>>>> >>>> >>> and other future features
>>>>> >>>> >>>
>>>>> >>>> >>> There is also a PR with the proposed API:
>>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>>
>>>>> >>>> >>> Let's discuss the proposal here rather than on that PR, to
>>>>> get better
>>>>> >>>> >>> visibility. Also, please take the time to read the proposal
>>>>> first. That
>>>>> >>>> >>> really helps clear up misconceptions.
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>> --
>>>>> >>>> >>> Ryan Blue
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >>> --
>>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> >>>> >>> https://amzn.to/2MaRAG9
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>> YouTube Live Streams:
>>>>> https://www.youtube.com/user/holdenkarau
>>>>> >>>> >>> &lt;
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&amp;data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&amp;reserved=0&gt
>>>>> ;
>>>>> >>>> >>>
>>>>> >>>> >>> --
>>>>> >>>> >>> Ryan Blue
>>>>> >>>> >>>
>>>>> >>>> >>>
>>>>> >>>> >
>>>>> >>>> > --
>>>>> >>>> > John Zhuge
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>> Sent from:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>> >>>>
>>>>> >>>>
>>>>> ---------------------------------------------------------------------
>>>>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >>>>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Ryan Blue
>>>>> > Software Engineer
>>>>> > Netflix
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to