+1, as I already proposed we can move forward with PRs > To move forward, how about we implement the function loading and binding first? Then we can have PRs for both the individual-parameters (I can take it) and row-parameter approaches, if we still can't reach a consensus at that time and need to see all the details.
Ryan, can we focus on the function loading and binding part and get it committed first? I can also fork your branch and put everything together, but that might be too big to review. On Tue, Feb 23, 2021 at 4:35 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > I've been still supporting Ryan's SPIP (original PR and its extension > proposal discussed here) because of its simplicity. > > According to this email thread context, I also understand the different > perspectives like Hyukjin's concerns about having multiple ways and > Wenchen's proposal and rationales. > > It looks like we need more discussion to reach an agreement. And the > technical details become more difficult to track because this is an email > thread. > > Although Ryan initially suggested discussing this on Apache email thread > instead of the PR, can we have a PR to discuss? > > Especially, Wenchen, could you make your PR based on Ryan's PR? > > If we collect the scattered ideas into a single PR, that would be greatly > helpful not only for further discussions, but also when we go on a vote on > Ryan's PR or Wenchen's PR. > > Bests, > Dongjoon. > > > On Mon, Feb 22, 2021 at 1:16 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi Walaa, >> >> Thanks for sharing this! The type signature stuff is already covered by >> the unbound UDF API, which specifies the input and output data types. The >> problem is how to check the method signature of the bound UDF. As you said, >> Java has type erasure and we can't check `List<String>` for example. >> >> My initial proposal is to do nothing and simply pass the Spark ArrayData, >> MapData, InternalRow to the UDF. This requires the UDF developers to ensure >> the type is matched, as they need to call something like >> `array.getLong(index)` with the corrected type name. It's as worse as the >> row-parameter version but seems fine as it only happens with nested types. >> And the type check is still done for the first level (the method signature >> must use ArrayData/MapData/InternalRow at least). >> >> We can allow more types in the future to make the type check better. It >> might be too detailed for this discussion thread but just put a few >> thoughts: >> 1. Java array doesn't do type erasure. We can use UTF8String[] for >> example if the input type is array of string. >> 2. For struct type, we can allow Java beans/Scala case classes if the >> field name and type match the type signature. >> 3. For map type, it's actually struct<keys: array<key_type>, values: >> array<value_type>>, so we can also allow Java beans/Scala case classes >> here. >> >> The general idea is to use stuff that can retain nested type information >> at compile-time, i.e. array, java bean, case classes. >> >> Thanks, >> Wenchen >> >> >> >> On Mon, Feb 22, 2021 at 3:47 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Wenchen, in Transport, users provide the input parameter signatures and >>> output parameter signature as part of the API. Compile-time checks are done >>> by parsing the type signatures and matching them to the type tree received >>> at compile-time. This also helps with inferring the concrete output type. >>> >>> The specification in the UDF API looks like this: >>> >>> @Override >>> public List<String> getInputParameterSignatures() { >>> return ImmutableList.of( >>> "ARRAY(K)", >>> "ARRAY(V)" >>> ); >>> } >>> >>> @Override >>> public String getOutputParameterSignature() { >>> return "MAP(K,V)"; >>> } >>> >>> The benefits of this type of type signature specification as opposed to >>> inferring types from Java type signatures given in the Java method are: >>> >>> - For nested types, Java type erasure eliminates the information >>> about nested types, so for something like my_function(List<String> >>> a1, List<Integer> a2), the value of both a1.class or a2.class is >>> just a List. However, we are planning to work around this in a >>> future version >>> >>> <https://github.com/linkedin/transport/tree/transport-api-v1/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples> >>> in >>> the case of Array and Map types. Struct types are discussed in the next >>> point. >>> - Without pre-code-generation there is no single Java type signature >>> from which we can capture the Struct info. However, Struct info can be >>> expressed in type signatures of the above type, e.g., ROW(FirstName >>> VARCHAR, LastName VARCHAR). >>> >>> When a Transport UDF represents a Spark UDF, the type signatures are >>> matched against Spark native types, i.e., >>> org.apache.spark.sql.types.{ArrayType, >>> MapType, StructType}, and primitive types. The function that >>> parses/compiles type signatures is found in AbstractTypeInference >>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-type-system/src/main/java/com/linkedin/transport/typesystem/AbstractTypeInference.java>. >>> This >>> class represents the generic component that is common between all supported >>> engines. Its Spark-specific extension is in SparkTypeInference >>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/typesystem/SparkTypeInference.scala>. >>> In the above example, at compile time, if the first Array happens to be of >>> String element type, and the second Array happens to be of Integer element >>> type, the UDF will communicate to the Spark analyzer that the output should >>> be of type MapData<String, Integer> (i.e., based on what was seen in >>> the input at compile time). The whole UDF becomes a Spark Expression >>> <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala#L24> >>> at the end of the day. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Sun, Feb 21, 2021 at 7:26 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> I think I have made it clear that it's simpler for the UDF developers >>>> to deal with the input parameters directly, instead of getting them from a >>>> row, as you need to provide the index and type (e.g. row.getLong(0)). >>>> It's also coherent with the existing Spark Scala/Java UDF APIs, so that >>>> Spark users will be more familiar with the individual-parameters API. >>>> >>>> And I have explained it already that we can use reflection to make sure >>>> the defined methods have the right types at query-compilation time. It's >>>> better than leaving this problem to the UDF developers and asking them to >>>> ensure the inputs are gotten from the row correctly with index and type. If >>>> there are people from Presto/Transport, it will be great if you can share >>>> how Presto/Transport do this check. >>>> >>>> I don't like 22 additional interfaces too, but if you look at the >>>> examples I gave, the current Spark Java UDF >>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java> >>>> only >>>> has 9 interfaces, and Transport >>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf> >>>> has >>>> 8. I think it's good enough and people can use VarargsScalarFunction if >>>> they need to take more parameters or varargs. It resolves your concern >>>> about doing reflection in the non-codegen execution path that leads to bad >>>> performance, it also serves for documentation purpose as people can easily >>>> see the number of UDF inputs and their types by a quick glance. >>>> >>>> As I said, we need to investigate how to avoid boxing. Since you are >>>> asking the question now, I spent sometime to think about it. I think the >>>> DoubleAdd example is the way to go. For non-codegen code path, we can >>>> just call the interface method. For the codegen code path, the generated >>>> Java code would look like (omit the null check logic): >>>> >>>> double input1 = ...; >>>> double input2 = ...; >>>> DoubleAdd udf = ...; >>>> double res = udf.call(input1, input2); >>>> >>>> Which invokes the primitive version automatically. AFAIK this is also >>>> how Scala supports primitive type parameter (generate an extra non-boxing >>>> version of the method). If the UDF doesn't have the primtive version >>>> method, this code will just call the boxed version and still works. >>>> >>>> I don't like the SupportsInvoke approach as it still promotes the >>>> row-parameter API. I think the individual-parameters API is better for UDF >>>> developers. Can other people share your opinions about the API? >>>> >>>> On Sat, Feb 20, 2021 at 5:32 AM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> I don’t see any benefit to more complexity with 22 additional >>>>> interfaces, instead of simply passing an InternalRow. Why not use a >>>>> single interface with InternalRow? Maybe you could share your >>>>> motivation? >>>>> >>>>> That would also result in strange duplication, where the >>>>> ScalarFunction2 method is just the boxed version: >>>>> >>>>> class DoubleAdd implements ScalarFunction2<Double, Double, Double> { >>>>> @Override >>>>> Double produceResult(Double left, Double right) { >>>>> return left + right; >>>>> } >>>>> >>>>> double produceResult(double left, double right) { >>>>> return left + right; >>>>> } >>>>> } >>>>> >>>>> This would work okay, but would be awkward if you wanted to use the >>>>> same implementation for any number of arguments, like a sum method >>>>> that adds all of the arguments together and returns the result. It also >>>>> isn’t great for varargs, since it is basically the same as the invoke >>>>> case. >>>>> >>>>> The combination of an InternalRow method and the invoke method seems >>>>> to be a good way to handle this to me. What is wrong with it? And, how >>>>> would you solve the problem when implementations define methods with the >>>>> wrong types? The InternalRow approach helps implementations catch >>>>> that problem (as demonstrated above) and also provides a fallback when >>>>> there is a but preventing the invoke optimization from working. That seems >>>>> like a good approach to me. >>>>> >>>>> On Thu, Feb 18, 2021 at 11:31 PM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> If people have such a big concern about reflection, we can follow the >>>>>> current >>>>>> Spark Java UDF >>>>>> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java> >>>>>> and Transport >>>>>> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf>, >>>>>> and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to avoid >>>>>> reflection. But we may need to investigate how to avoid boxing with this >>>>>> API design. >>>>>> >>>>>> To put a detailed proposal: let's have ScalarFuncion0, ScalarFuncion1, >>>>>> ..., ScalarFuncion9 and VarargsScalarFunction. At execution time, if >>>>>> Spark sees ScalarFuncion0-9, pass the input columns to the UDF >>>>>> directly, one column one parameter. So string type input is >>>>>> UTF8String, array type input is ArrayData. If Spark sees >>>>>> VarargsScalarFunction, wrap the input columns with InternalRow and >>>>>> pass it to the UDF. >>>>>> >>>>>> In general, if VarargsScalarFunction is implemented, the UDF should >>>>>> not implement ScalarFuncion0-9. We can also define a priority order >>>>>> to allow this. I don't have a strong preference here. >>>>>> >>>>>> What do you think? >>>>>> >>>>>> On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> I agree with Ryan on the questions around the expressivity of the >>>>>>> Invoke method. It is not clear to me how the Invoke method can be used >>>>>>> to >>>>>>> declare UDFs with type-parameterized parameters. For example: a UDF to >>>>>>> get >>>>>>> the Nth element of an array (regardless of the Array element type) or a >>>>>>> UDF >>>>>>> to merge two Arrays (of generic types) to a Map. >>>>>>> >>>>>>> Also, to address Wenchen's InternalRow question, can we create a >>>>>>> number of Function classes, each corresponding to a number of input >>>>>>> parameter length (e.g., ScalarFunction1, ScalarFunction2, etc)? >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree with you that it is better in many cases to directly call a >>>>>>>> method. But it it not better in all cases, which is why I don’t think >>>>>>>> it is >>>>>>>> the right general-purpose choice. >>>>>>>> >>>>>>>> First, if codegen isn’t used for some reason, the reflection >>>>>>>> overhead is really significant. That gets much better when you have an >>>>>>>> interface to call. That’s one reason I’d use this pattern: >>>>>>>> >>>>>>>> class DoubleAdd implements ScalarFunction<Double>, SupportsInvoke { >>>>>>>> Double produceResult(InternalRow row) { >>>>>>>> return produceResult(row.getDouble(0), row.getDouble(1)); >>>>>>>> } >>>>>>>> >>>>>>>> double produceResult(double left, double right) { >>>>>>>> return left + right; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> There’s little overhead to adding the InternalRow variation, but >>>>>>>> we could call it in eval to avoid the reflect overhead. To the >>>>>>>> point about UDF developers, I think this is a reasonable cost. >>>>>>>> >>>>>>>> Second, I think usability is better and helps avoid runtime issues. >>>>>>>> Here’s an example: >>>>>>>> >>>>>>>> class StrLen implements ScalarFunction<Integer>, SupportsInvoke { >>>>>>>> Integer produceResult(InternalRow row) { >>>>>>>> return produceResult(row.getString(0)); >>>>>>>> } >>>>>>>> >>>>>>>> Integer produceResult(String str) { >>>>>>>> return str.length(); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> See the bug? I forgot to use UTF8String instead of String. With >>>>>>>> the InternalRow method, I get a compiler warning because getString >>>>>>>> produces UTF8String that can’t be passed to produceResult(String). >>>>>>>> If I decided to implement length separately, then we could still >>>>>>>> run the InternalRow version and log a warning. The code would be >>>>>>>> slightly slower, but wouldn’t fail. >>>>>>>> >>>>>>>> There are similar situations with varargs where it’s better to call >>>>>>>> methods that produce concrete types than to cast from Object to >>>>>>>> some expected type. >>>>>>>> >>>>>>>> I think that using invoke is a great extension to the proposal, but >>>>>>>> I don’t think that it should be the only way to call functions. By all >>>>>>>> means, let’s work on it in parallel and use it where possible. But I >>>>>>>> think >>>>>>>> we do need the fallback of using InternalRow and that it isn’t a >>>>>>>> usability problem to include it. >>>>>>>> >>>>>>>> Oh, and one last thought is that we already have users that call >>>>>>>> Dataset.map and use InternalRow. This would allow converting that code >>>>>>>> directly to a UDF. >>>>>>>> >>>>>>>> I think we’re closer to agreeing here than it actually looks. >>>>>>>> Hopefully you’ll agree that having the InternalRow method isn’t a >>>>>>>> big usability problem. >>>>>>>> >>>>>>>> On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I don't see any objections to the rest of the proposal (loading >>>>>>>>> functions from the catalog, function binding stuff, etc.) and I assume >>>>>>>>> everyone is OK with it. We can commit that part first. >>>>>>>>> >>>>>>>>> Currently, the discussion focuses on the `ScalarFunction` API, >>>>>>>>> where I think it's better to directly take the input columns as the >>>>>>>>> UDF >>>>>>>>> parameter, instead of wrapping the input columns with InternalRow >>>>>>>>> and taking the InternalRow as the UDF parameter. It's not only >>>>>>>>> for better performance, but also for ease of use. For example, it's >>>>>>>>> easier >>>>>>>>> for the UDF developer to write `input1 + input2` than >>>>>>>>> `inputRow.getLong(0) >>>>>>>>> + inputRow.getLong(1)`, as they don't need to specify the >>>>>>>>> type and index by themselves (getLong(0)) which is error-prone. >>>>>>>>> >>>>>>>>> It does push more work to the Spark side, but I think it's worth >>>>>>>>> it if implementing UDF gets easier. I don't think the work is very >>>>>>>>> challenging, as we can leverage the infra we built for the expression >>>>>>>>> encoder. >>>>>>>>> >>>>>>>>> I think it's also important to look at the UDF API from the user's >>>>>>>>> perspective (UDF developers). How do you like the UDF API without >>>>>>>>> considering how Spark can support it? Do you prefer the >>>>>>>>> individual-parameters version or the row-parameter version? >>>>>>>>> >>>>>>>>> To move forward, how about we implement the function loading and >>>>>>>>> binding first? Then we can have PRs for both the >>>>>>>>> individual-parameters (I >>>>>>>>> can take it) and row-parameter approaches, if we still can't reach a >>>>>>>>> consensus at that time and need to see all the details. >>>>>>>>> >>>>>>>>> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue >>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Thanks, Hyukjin. I think that's a fair summary. And I agree with >>>>>>>>>> the idea that we should focus on what Spark will do by default. >>>>>>>>>> >>>>>>>>>> I think we should focus on the proposal, for two reasons: first, >>>>>>>>>> there is a straightforward path to incorporate Wenchen's suggestion >>>>>>>>>> via >>>>>>>>>> `SupportsInvoke`, and second, the proposal is more complete: it >>>>>>>>>> defines a >>>>>>>>>> solution for many concerns like loading a function and finding out >>>>>>>>>> what >>>>>>>>>> types to use -- not just how to call code -- and supports more use >>>>>>>>>> cases >>>>>>>>>> like varargs functions. I think we can continue to discuss the rest >>>>>>>>>> of the >>>>>>>>>> proposal and be confident that we can support an invoke code path >>>>>>>>>> where it >>>>>>>>>> makes sense. >>>>>>>>>> >>>>>>>>>> Does everyone agree? If not, I think we would need to solve a lot >>>>>>>>>> of the challenges that I initially brought up with the invoke idea. >>>>>>>>>> It >>>>>>>>>> seems like a good way to call a function, but needs a real proposal >>>>>>>>>> behind >>>>>>>>>> it if we don't use it via `SupportsInvoke` in the current proposal. >>>>>>>>>> >>>>>>>>>> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon < >>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Just to make sure we don’t move past, I think we haven’t decided >>>>>>>>>>> yet: >>>>>>>>>>> >>>>>>>>>>> - if we’ll replace the current proposal to Wenchen’s >>>>>>>>>>> approach as the default >>>>>>>>>>> - if we want to have Wenchen’s approach as an optional >>>>>>>>>>> mix-in on the top of Ryan’s proposal (SupportsInvoke) >>>>>>>>>>> >>>>>>>>>>> From what I read, some people pointed out it as a replacement. >>>>>>>>>>> Please correct me if I misread this discussion thread. >>>>>>>>>>> As Dongjoon pointed out, it would be good to know rough ETA to >>>>>>>>>>> make sure making progress in this, and people can compare more >>>>>>>>>>> easily. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> FWIW, there’s the saying I like in the zen of Python >>>>>>>>>>> <https://www.python.org/dev/peps/pep-0020/>: >>>>>>>>>>> >>>>>>>>>>> There should be one— and preferably only one —obvious way to do >>>>>>>>>>> it. >>>>>>>>>>> >>>>>>>>>>> If multiple approaches have the way for developers to do the >>>>>>>>>>> (almost) same thing, I would prefer to avoid it. >>>>>>>>>>> >>>>>>>>>>> In addition, I would prefer to focus on what Spark does by >>>>>>>>>>> default first. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun <dongjoon.h...@gmail.com>님이 >>>>>>>>>>> 작성: >>>>>>>>>>> >>>>>>>>>>>> Hi, Wenchen. >>>>>>>>>>>> >>>>>>>>>>>> This thread seems to get enough attention. Also, I'm expecting >>>>>>>>>>>> more and more if we have this on the `master` branch because we are >>>>>>>>>>>> developing together. >>>>>>>>>>>> >>>>>>>>>>>> > Spark SQL has many active contributors/committers and >>>>>>>>>>>> this thread doesn't get much attention yet. >>>>>>>>>>>> >>>>>>>>>>>> So, what's your ETA from now? >>>>>>>>>>>> >>>>>>>>>>>> > I think the problem here is we were discussing some very >>>>>>>>>>>> detailed things without actual code. >>>>>>>>>>>> > I'll implement my idea after the holiday and then we can >>>>>>>>>>>> have more effective discussions. >>>>>>>>>>>> > We can also do benchmarks and get some real numbers. >>>>>>>>>>>> > In the meantime, we can continue to discuss other parts >>>>>>>>>>>> of this proposal, and make a prototype if possible. >>>>>>>>>>>> >>>>>>>>>>>> I'm looking forward to seeing your PR. I hope we can conclude >>>>>>>>>>>> this thread and have at least one implementation in the `master` >>>>>>>>>>>> branch >>>>>>>>>>>> this month (February). >>>>>>>>>>>> If you need more time (one month or longer), why don't we have >>>>>>>>>>>> Ryan's suggestion in the `master` branch first and benchmark with >>>>>>>>>>>> your PR >>>>>>>>>>>> later during Apache Spark 3.2 timeframe. >>>>>>>>>>>> >>>>>>>>>>>> Bests, >>>>>>>>>>>> Dongjoon. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue >>>>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> The proposal already includes an API for aggregate functions >>>>>>>>>>>>> and I think we would want to implement those right away. >>>>>>>>>>>>> >>>>>>>>>>>>> Processing ColumnBatch is something we can easily extend the >>>>>>>>>>>>> interfaces to support, similar to Wenchen's suggestion. The >>>>>>>>>>>>> important thing >>>>>>>>>>>>> right now is to agree on some basic functionality: how to look up >>>>>>>>>>>>> functions >>>>>>>>>>>>> and what the simple API should be. Like the TableCatalog >>>>>>>>>>>>> interfaces, we >>>>>>>>>>>>> will layer on more support through optional interfaces like >>>>>>>>>>>>> `SupportsInvoke` or `SupportsColumnBatch`. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo < >>>>>>>>>>>>> andrew.m...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello Ryan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> This proposal looks very interesting. Would future goals for >>>>>>>>>>>>>> this >>>>>>>>>>>>>> functionality include both support for aggregation functions, >>>>>>>>>>>>>> as well >>>>>>>>>>>>>> as support for processing ColumnBatch-es (instead of >>>>>>>>>>>>>> Row/InternalRow)? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue >>>>>>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thanks for the positive feedback, everyone. It sounds like >>>>>>>>>>>>>> there is a clear path forward for calling functions. Even >>>>>>>>>>>>>> without a >>>>>>>>>>>>>> prototype, the `invoke` plans show that Wenchen's suggested >>>>>>>>>>>>>> optimization >>>>>>>>>>>>>> can be done, and incorporating it as an optional extension to >>>>>>>>>>>>>> this proposal >>>>>>>>>>>>>> solves many of the unknowns. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > With that area now understood, is there any discussion >>>>>>>>>>>>>> about other parts of the proposal, besides the function call >>>>>>>>>>>>>> interface? >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun < >>>>>>>>>>>>>> sunc...@apache.org> wrote: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> This is an important feature which can unblock several >>>>>>>>>>>>>> other projects including bucket join support for DataSource v2, >>>>>>>>>>>>>> complete >>>>>>>>>>>>>> support for enforcing DataSource v2 distribution requirements on >>>>>>>>>>>>>> the write >>>>>>>>>>>>>> path, etc. I like Ryan's proposals which look simple and >>>>>>>>>>>>>> elegant, with nice >>>>>>>>>>>>>> support on function overloading and variadic arguments. On the >>>>>>>>>>>>>> other hand, >>>>>>>>>>>>>> I think Wenchen made a very good point about performance. >>>>>>>>>>>>>> Overall, I'm >>>>>>>>>>>>>> excited to see active discussions on this topic and believe the >>>>>>>>>>>>>> community >>>>>>>>>>>>>> will come to a proposal with the best of both sides. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> Chao >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon < >>>>>>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> +1 for Liang-chi's. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> Thanks Ryan and Wenchen for leading this. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh < >>>>>>>>>>>>>> vii...@gmail.com>님이 작성: >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> Basically I think the proposal makes sense to me and I'd >>>>>>>>>>>>>> like to support the >>>>>>>>>>>>>> >>>> SPIP as it looks like we have strong need for the >>>>>>>>>>>>>> important feature. >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> Thanks Ryan for working on this and I do also look >>>>>>>>>>>>>> forward to Wenchen's >>>>>>>>>>>>>> >>>> implementation. Thanks for the discussion too. >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> Actually I think the SupportsInvoke proposed by Ryan >>>>>>>>>>>>>> looks a good >>>>>>>>>>>>>> >>>> alternative to me. Besides Wenchen's alternative >>>>>>>>>>>>>> implementation, is there a >>>>>>>>>>>>>> >>>> chance we also have the SupportsInvoke for comparison? >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> John Zhuge wrote >>>>>>>>>>>>>> >>>> > Excited to see our Spark community rallying behind >>>>>>>>>>>>>> this important feature! >>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>> >>>> > The proposal lays a solid foundation of minimal >>>>>>>>>>>>>> feature set with careful >>>>>>>>>>>>>> >>>> > considerations for future optimizations and >>>>>>>>>>>>>> extensions. Can't wait to see >>>>>>>>>>>>>> >>>> > it leading to more advanced functionalities like views >>>>>>>>>>>>>> with shared custom >>>>>>>>>>>>>> >>>> > functions, function pushdown, lambda, etc. It has >>>>>>>>>>>>>> already borne fruit from >>>>>>>>>>>>>> >>>> > the constructive collaborations in this thread. >>>>>>>>>>>>>> Looking forward to >>>>>>>>>>>>>> >>>> > Wenchen's prototype and further discussions including >>>>>>>>>>>>>> the SupportsInvoke >>>>>>>>>>>>>> >>>> > extension proposed by Ryan. >>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > owen.omalley@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> > wrote: >>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>> >>>> >> I think this proposal is a very good thing giving >>>>>>>>>>>>>> Spark a standard way of >>>>>>>>>>>>>> >>>> >> getting to and calling UDFs. >>>>>>>>>>>>>> >>>> >> >>>>>>>>>>>>>> >>>> >> I like having the ScalarFunction as the API to call >>>>>>>>>>>>>> the UDFs. It is >>>>>>>>>>>>>> >>>> >> simple, yet covers all of the polymorphic type cases >>>>>>>>>>>>>> well. I think it >>>>>>>>>>>>>> >>>> >> would >>>>>>>>>>>>>> >>>> >> also simplify using the functions in other contexts >>>>>>>>>>>>>> like pushing down >>>>>>>>>>>>>> >>>> >> filters into the ORC & Parquet readers although there >>>>>>>>>>>>>> are a lot of >>>>>>>>>>>>>> >>>> >> details >>>>>>>>>>>>>> >>>> >> that would need to be considered there. >>>>>>>>>>>>>> >>>> >> >>>>>>>>>>>>>> >>>> >> .. Owen >>>>>>>>>>>>>> >>>> >> >>>>>>>>>>>>>> >>>> >> >>>>>>>>>>>>>> >>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > ekrogen@.com >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >> wrote: >>>>>>>>>>>>>> >>>> >> >>>>>>>>>>>>>> >>>> >>> I agree that there is a strong need for a >>>>>>>>>>>>>> FunctionCatalog within Spark >>>>>>>>>>>>>> >>>> >>> to >>>>>>>>>>>>>> >>>> >>> provide support for shareable UDFs, as well as make >>>>>>>>>>>>>> movement towards >>>>>>>>>>>>>> >>>> >>> more >>>>>>>>>>>>>> >>>> >>> advanced functionality like views which themselves >>>>>>>>>>>>>> depend on UDFs, so I >>>>>>>>>>>>>> >>>> >>> support this SPIP wholeheartedly. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I find both of the proposed UDF APIs to be >>>>>>>>>>>>>> sufficiently user-friendly >>>>>>>>>>>>>> >>>> >>> and >>>>>>>>>>>>>> >>>> >>> extensible. I generally think Wenchen's proposal is >>>>>>>>>>>>>> easier for a user to >>>>>>>>>>>>>> >>>> >>> work with in the common case, but has greater >>>>>>>>>>>>>> potential for confusing >>>>>>>>>>>>>> >>>> >>> and >>>>>>>>>>>>>> >>>> >>> hard-to-debug behavior due to use of reflective >>>>>>>>>>>>>> method signature >>>>>>>>>>>>>> >>>> >>> searches. >>>>>>>>>>>>>> >>>> >>> The merits on both sides can hopefully be more >>>>>>>>>>>>>> properly examined with >>>>>>>>>>>>>> >>>> >>> code, >>>>>>>>>>>>>> >>>> >>> so I look forward to seeing an implementation of >>>>>>>>>>>>>> Wenchen's ideas to >>>>>>>>>>>>>> >>>> >>> provide >>>>>>>>>>>>>> >>>> >>> a more concrete comparison. I am optimistic that we >>>>>>>>>>>>>> will not let the >>>>>>>>>>>>>> >>>> >>> debate >>>>>>>>>>>>>> >>>> >>> over this point unreasonably stall the SPIP from >>>>>>>>>>>>>> making progress. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Thank you to both Wenchen and Ryan for your detailed >>>>>>>>>>>>>> consideration and >>>>>>>>>>>>>> >>>> >>> evaluation of these ideas! >>>>>>>>>>>>>> >>>> >>> ------------------------------ >>>>>>>>>>>>>> >>>> >>> *From:* Dongjoon Hyun < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > dongjoon.hyun@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM >>>>>>>>>>>>>> >>>> >>> *To:* Ryan Blue < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > blue@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >>> *Cc:* Holden Karau < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > holden@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > >; Hyukjin Kwon < >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > gurwls223@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>; Spark Dev List < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > dev@.apache >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > >; Wenchen Fan >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > cloud0fan@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> BTW, I forgot to add my opinion explicitly in this >>>>>>>>>>>>>> thread because I was >>>>>>>>>>>>>> >>>> >>> on the PR before this thread. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> 1. The `FunctionCatalog API` PR was made on May 9, >>>>>>>>>>>>>> 2019 and has been >>>>>>>>>>>>>> >>>> >>> there for almost two years. >>>>>>>>>>>>>> >>>> >>> 2. I already gave my +1 on that PR last Saturday >>>>>>>>>>>>>> because I agreed with >>>>>>>>>>>>>> >>>> >>> the latest updated design docs and AS-IS PR. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> And, the rest of the progress in this thread is also >>>>>>>>>>>>>> very satisfying to >>>>>>>>>>>>>> >>>> >>> me. >>>>>>>>>>>>>> >>>> >>> (e.g. Ryan's extension suggestion and Wenchen's >>>>>>>>>>>>>> alternative) >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> To All: >>>>>>>>>>>>>> >>>> >>> Please take a look at the design doc and the PR, and >>>>>>>>>>>>>> give us some >>>>>>>>>>>>>> >>>> >>> opinions. >>>>>>>>>>>>>> >>>> >>> We really need your participation in order to make >>>>>>>>>>>>>> DSv2 more complete. >>>>>>>>>>>>>> >>>> >>> This will unblock other DSv2 features, too. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Bests, >>>>>>>>>>>>>> >>>> >>> Dongjoon. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > dongjoon.hyun@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >>> wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Hi, Ryan. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> We didn't move past anything (both yours and >>>>>>>>>>>>>> Wenchen's). What Wenchen >>>>>>>>>>>>>> >>>> >>> suggested is double-checking the alternatives with >>>>>>>>>>>>>> the implementation to >>>>>>>>>>>>>> >>>> >>> give more momentum to our discussion. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Your new suggestion about optional extention also >>>>>>>>>>>>>> sounds like a new >>>>>>>>>>>>>> >>>> >>> reasonable alternative to me. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> We are still discussing this topic together and I >>>>>>>>>>>>>> hope we can make a >>>>>>>>>>>>>> >>>> >>> conclude at this time (for Apache Spark 3.2) without >>>>>>>>>>>>>> being stucked like >>>>>>>>>>>>>> >>>> >>> last time. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I really appreciate your leadership in this >>>>>>>>>>>>>> dicsussion and the moving >>>>>>>>>>>>>> >>>> >>> direction of this discussion looks constructive to >>>>>>>>>>>>>> me. Let's give some >>>>>>>>>>>>>> >>>> >>> time >>>>>>>>>>>>>> >>>> >>> to the alternatives. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Bests, >>>>>>>>>>>>>> >>>> >>> Dongjoon. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > blue@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I don’t think we should so quickly move past the >>>>>>>>>>>>>> drawbacks of this >>>>>>>>>>>>>> >>>> >>> approach. The problems are significant enough that >>>>>>>>>>>>>> using invoke is not >>>>>>>>>>>>>> >>>> >>> sufficient on its own. But, I think we can add it as >>>>>>>>>>>>>> an optional >>>>>>>>>>>>>> >>>> >>> extension >>>>>>>>>>>>>> >>>> >>> to shore up the weaknesses. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Here’s a summary of the drawbacks: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> - Magic function signatures are error-prone >>>>>>>>>>>>>> >>>> >>> - Spark would need considerable code to help >>>>>>>>>>>>>> users find what went >>>>>>>>>>>>>> >>>> >>> wrong >>>>>>>>>>>>>> >>>> >>> - Spark would likely need to coerce arguments >>>>>>>>>>>>>> (e.g., String, >>>>>>>>>>>>>> >>>> >>> Option[Int]) for usability >>>>>>>>>>>>>> >>>> >>> - It is unclear how Spark will find the Java >>>>>>>>>>>>>> Method to call >>>>>>>>>>>>>> >>>> >>> - Use cases that require varargs fall back to >>>>>>>>>>>>>> casting; users will >>>>>>>>>>>>>> >>>> >>> also get this wrong (cast to String instead of >>>>>>>>>>>>>> UTF8String) >>>>>>>>>>>>>> >>>> >>> - The non-codegen path is significantly slower >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> The benefit of invoke is to avoid moving data into a >>>>>>>>>>>>>> row, like this: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> -- using invoke >>>>>>>>>>>>>> >>>> >>> int result = udfFunction(x, y) >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> -- using row >>>>>>>>>>>>>> >>>> >>> udfRow.update(0, x); -- actual: values[0] = x; >>>>>>>>>>>>>> >>>> >>> udfRow.update(1, y); >>>>>>>>>>>>>> >>>> >>> int result = udfFunction(udfRow); >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> And, again, that won’t actually help much in cases >>>>>>>>>>>>>> that require varargs. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I suggest we add a new marker trait for BoundMethod >>>>>>>>>>>>>> called >>>>>>>>>>>>>> >>>> >>> SupportsInvoke. >>>>>>>>>>>>>> >>>> >>> If that interface is implemented, then Spark will >>>>>>>>>>>>>> look for a method that >>>>>>>>>>>>>> >>>> >>> matches the expected signature based on the bound >>>>>>>>>>>>>> input type. If it >>>>>>>>>>>>>> >>>> >>> isn’t >>>>>>>>>>>>>> >>>> >>> found, Spark can print a warning and fall back to >>>>>>>>>>>>>> the InternalRow call: >>>>>>>>>>>>>> >>>> >>> “Cannot find udfFunction(int, int)”. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> This approach allows the invoke optimization, but >>>>>>>>>>>>>> solves many of the >>>>>>>>>>>>>> >>>> >>> problems: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> - The method to invoke is found using the >>>>>>>>>>>>>> proposed load and bind >>>>>>>>>>>>>> >>>> >>> approach >>>>>>>>>>>>>> >>>> >>> - Magic function signatures are optional and do >>>>>>>>>>>>>> not cause runtime >>>>>>>>>>>>>> >>>> >>> failures >>>>>>>>>>>>>> >>>> >>> - Because this is an optional optimization, Spark >>>>>>>>>>>>>> can be more strict >>>>>>>>>>>>>> >>>> >>> about types >>>>>>>>>>>>>> >>>> >>> - Varargs cases can still use rows >>>>>>>>>>>>>> >>>> >>> - Non-codegen can use an evaluation method rather >>>>>>>>>>>>>> than falling back >>>>>>>>>>>>>> >>>> >>> to slow Java reflection >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> This seems like a good extension to me; this >>>>>>>>>>>>>> provides a plan for >>>>>>>>>>>>>> >>>> >>> optimizing the UDF call to avoid building a row, >>>>>>>>>>>>>> while the existing >>>>>>>>>>>>>> >>>> >>> proposal covers the other cases well and addresses >>>>>>>>>>>>>> how to locate these >>>>>>>>>>>>>> >>>> >>> function calls. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> This also highlights that the approach used in DSv2 >>>>>>>>>>>>>> and this proposal is >>>>>>>>>>>>>> >>>> >>> working: start small and use extensions to layer on >>>>>>>>>>>>>> more complex >>>>>>>>>>>>>> >>>> >>> support. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > dongjoon.hyun@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > >>>>>>>>>>>>>> >>>> >>> wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Thank you all for making a giant move forward for >>>>>>>>>>>>>> Apache Spark 3.2.0. >>>>>>>>>>>>>> >>>> >>> I'm really looking forward to seeing Wenchen's >>>>>>>>>>>>>> implementation. >>>>>>>>>>>>>> >>>> >>> That would be greatly helpful to make a decision! >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> > I'll implement my idea after the holiday and then >>>>>>>>>>>>>> we can have >>>>>>>>>>>>>> >>>> >>> more effective discussions. We can also do >>>>>>>>>>>>>> benchmarks and get some real >>>>>>>>>>>>>> >>>> >>> numbers. >>>>>>>>>>>>>> >>>> >>> > FYI: the Presto UDF API >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067978066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iMWmHqqXPcT7EK%2Bovyzhy%2BZpU6Llih%2BwdZD53wvobmc%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> also >>>>>>>>>>>>>> >>>> >>> takes individual parameters instead of the row >>>>>>>>>>>>>> parameter. I think this >>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see >>>>>>>>>>>>>> the performance >>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as >>>>>>>>>>>>>> an alternative >>>>>>>>>>>>>> >>>> >>> (Trino). >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Bests, >>>>>>>>>>>>>> >>>> >>> Dongjoon. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Tue, Feb 9, 2021 at 10:18 PM Wenchen Fan < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > cloud0fan@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> FYI: the Presto UDF API >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprestodb.io%2Fdocs%2Fcurrent%2Fdevelop%2Ffunctions.html&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZSBCR7yx3PpwL4KY9V73JG42Z02ZodqkjxC0LweHt1g%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> also takes individual parameters instead of the row >>>>>>>>>>>>>> parameter. I think >>>>>>>>>>>>>> >>>> >>> this >>>>>>>>>>>>>> >>>> >>> direction at least worth a try so that we can see >>>>>>>>>>>>>> the performance >>>>>>>>>>>>>> >>>> >>> difference. It's also mentioned in the design doc as >>>>>>>>>>>>>> an alternative >>>>>>>>>>>>>> >>>> >>> (Trino). >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > cloud0fan@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Hi Holden, >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> As Hyukjin said, following existing designs is not >>>>>>>>>>>>>> the principle of DS >>>>>>>>>>>>>> >>>> >>> v2 >>>>>>>>>>>>>> >>>> >>> API design. We should make sure the DS v2 API makes >>>>>>>>>>>>>> sense. AFAIK we >>>>>>>>>>>>>> >>>> >>> didn't >>>>>>>>>>>>>> >>>> >>> fully follow the catalog API design from Hive and I >>>>>>>>>>>>>> believe Ryan also >>>>>>>>>>>>>> >>>> >>> agrees with it. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I think the problem here is we were discussing some >>>>>>>>>>>>>> very detailed things >>>>>>>>>>>>>> >>>> >>> without actual code. I'll implement my idea after >>>>>>>>>>>>>> the holiday and then >>>>>>>>>>>>>> >>>> >>> we >>>>>>>>>>>>>> >>>> >>> can have more effective discussions. We can also do >>>>>>>>>>>>>> benchmarks and get >>>>>>>>>>>>>> >>>> >>> some >>>>>>>>>>>>>> >>>> >>> real numbers. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> In the meantime, we can continue to discuss other >>>>>>>>>>>>>> parts of this >>>>>>>>>>>>>> >>>> >>> proposal, >>>>>>>>>>>>>> >>>> >>> and make a prototype if possible. Spark SQL has many >>>>>>>>>>>>>> active >>>>>>>>>>>>>> >>>> >>> contributors/committers and this thread doesn't get >>>>>>>>>>>>>> much attention yet. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Wed, Feb 10, 2021 at 6:17 AM Hyukjin Kwon < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > gurwls223@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Just dropping a few lines. I remember that one of >>>>>>>>>>>>>> the goals in DSv2 is >>>>>>>>>>>>>> >>>> >>> to >>>>>>>>>>>>>> >>>> >>> correct the mistakes we made in the current Spark >>>>>>>>>>>>>> codes. >>>>>>>>>>>>>> >>>> >>> It would not have much point if we will happen to >>>>>>>>>>>>>> just follow and mimic >>>>>>>>>>>>>> >>>> >>> what Spark currently does. It might just end up with >>>>>>>>>>>>>> another copy of >>>>>>>>>>>>>> >>>> >>> Spark >>>>>>>>>>>>>> >>>> >>> APIs, e.g. Expression (internal) APIs. I sincerely >>>>>>>>>>>>>> would like to avoid >>>>>>>>>>>>>> >>>> >>> this >>>>>>>>>>>>>> >>>> >>> I do believe we have been stuck mainly due to trying >>>>>>>>>>>>>> to come up with a >>>>>>>>>>>>>> >>>> >>> better design. We already have an ugly picture of >>>>>>>>>>>>>> the current Spark APIs >>>>>>>>>>>>>> >>>> >>> to >>>>>>>>>>>>>> >>>> >>> draw a better bigger picture. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > holden@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > >님이 작성: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I think this proposal is a good set of trade-offs >>>>>>>>>>>>>> and has existed in the >>>>>>>>>>>>>> >>>> >>> community for a long period of time. I especially >>>>>>>>>>>>>> appreciate how the >>>>>>>>>>>>>> >>>> >>> design >>>>>>>>>>>>>> >>>> >>> is focused on a minimal useful component, with >>>>>>>>>>>>>> future optimizations >>>>>>>>>>>>>> >>>> >>> considered from a point of view of making sure it's >>>>>>>>>>>>>> flexible, but actual >>>>>>>>>>>>>> >>>> >>> concrete decisions left for the future once we see >>>>>>>>>>>>>> how this API is used. >>>>>>>>>>>>>> >>>> >>> I >>>>>>>>>>>>>> >>>> >>> think if we try and optimize everything right out of >>>>>>>>>>>>>> the gate, we'll >>>>>>>>>>>>>> >>>> >>> quickly get stuck (again) and not make any progress. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue < >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > blue@ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> > > wrote: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Hi everyone, >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> I'd like to start a discussion for adding a >>>>>>>>>>>>>> FunctionCatalog interface to >>>>>>>>>>>>>> >>>> >>> catalog plugins. This will allow catalogs to expose >>>>>>>>>>>>>> functions to Spark, >>>>>>>>>>>>>> >>>> >>> similar to how the TableCatalog interface allows a >>>>>>>>>>>>>> catalog to expose >>>>>>>>>>>>>> >>>> >>> tables. The proposal doc is available here: >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U%2Fedit&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Kyth8%2FhNUZ6GXG2FsgcknZ7t7s0%2BpxnDMPyxvsxLLqE%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Here's a high-level summary of some of the main >>>>>>>>>>>>>> design choices: >>>>>>>>>>>>>> >>>> >>> * Adds the ability to list and load functions, not >>>>>>>>>>>>>> to create or modify >>>>>>>>>>>>>> >>>> >>> them in an external catalog >>>>>>>>>>>>>> >>>> >>> * Supports scalar, aggregate, and partial aggregate >>>>>>>>>>>>>> functions >>>>>>>>>>>>>> >>>> >>> * Uses load and bind steps for better error messages >>>>>>>>>>>>>> and simpler >>>>>>>>>>>>>> >>>> >>> implementations >>>>>>>>>>>>>> >>>> >>> * Like the DSv2 table read and write APIs, it uses >>>>>>>>>>>>>> InternalRow to pass >>>>>>>>>>>>>> >>>> >>> data >>>>>>>>>>>>>> >>>> >>> * Can be extended using mix-in interfaces to add >>>>>>>>>>>>>> vectorization, codegen, >>>>>>>>>>>>>> >>>> >>> and other future features >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> There is also a PR with the proposed API: >>>>>>>>>>>>>> >>>> >>> https://github.com/apache/spark/pull/24559/files >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F24559%2Ffiles&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067988024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=t3ZCqffdsrmCY3X%2FT8x1oMjMcNUiQ0wQNk%2ByAXQx1Io%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> Let's discuss the proposal here rather than on that >>>>>>>>>>>>>> PR, to get better >>>>>>>>>>>>>> >>>> >>> visibility. Also, please take the time to read the >>>>>>>>>>>>>> proposal first. That >>>>>>>>>>>>>> >>>> >>> really helps clear up misconceptions. >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> -- >>>>>>>>>>>>>> >>>> >>> Ryan Blue >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> -- >>>>>>>>>>>>>> >>>> >>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>>>>> >>>> >>> https://amzn.to/2MaRAG9 >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> YouTube Live Streams: >>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>>>>>>>>> >>>> >>> < >>>>>>>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&data=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D&reserved=0> >>>>>>>>>>>>>> ; >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> -- >>>>>>>>>>>>>> >>>> >>> Ryan Blue >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> >>> >>>>>>>>>>>>>> >>>> > >>>>>>>>>>>>>> >>>> > -- >>>>>>>>>>>>>> >>>> > John Zhuge >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> -- >>>>>>>>>>>>>> >>>> Sent from: >>>>>>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > -- >>>>>>>>>>>>>> > Ryan Blue >>>>>>>>>>>>>> > Software Engineer >>>>>>>>>>>>>> > Netflix >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>> Netflix >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>>