Hi all, Thank you for voting and suggestions.
As Wenchen mentioned and also we're discussing at JIRA, we need to discuss the size hint for the 0-parameter UDF. But I believe we got a consensus about the basic APIs except for the size hint, I'd like to submit a pr based on the current proposal and continue discussing in its review. https://github.com/apache/spark/pull/19147 I'd keep this vote open to wait for more opinions. Thanks. On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > +1 on the design and proposed API. > > One detail I'd like to discuss is the 0-parameter UDF, how we can specify > the size hint. This can be done in the PR review though. > > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> +1 on this and like the suggestion of type in string form. >> >> Would it be correct to assume there will be data type check, for example >> the returned pandas data frame column data types match what are specified. >> We have seen quite a bit of issues/confusions with that in R. >> >> Would it make sense to have a more generic decorator name so that it >> could also be useable for other efficient vectorized format in the future? >> Or do we anticipate the decorator to be format specific and will have more >> in the future? >> >> ------------------------------ >> *From:* Reynold Xin <r...@databricks.com> >> *Sent:* Friday, September 1, 2017 5:16:11 AM >> *To:* Takuya UESHIN >> *Cc:* spark-dev >> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >> >> Ok, thanks. >> >> +1 on the SPIP for scope etc >> >> >> On API details (will deal with in code reviews as well but leaving a note >> here in case I forget) >> >> 1. I would suggest having the API also accept data type specification in >> string form. It is usually simpler to say "long" then "LongType()". >> >> 2. Think about what error message to show when the rows numbers don't >> match at runtime. >> >> >> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> >> wrote: >> >>> Yes, the aggregation is out of scope for now. >>> I think we should continue discussing the aggregation at JIRA and we >>> will be adding those later separately. >>> >>> Thanks. >>> >>> >>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote: >>> >>>> Is the idea aggregate is out of scope for the current effort and we >>>> will be adding those later? >>>> >>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We've been discussing to support vectorized UDFs in Python and we >>>>> almost got a consensus about the APIs, so I'd like to summarize and >>>>> call for a vote. >>>>> >>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs >>>>> for vectorized UDAFs or Window operations. >>>>> >>>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>>> >>>>> >>>>> *Proposed API* >>>>> >>>>> We introduce a @pandas_udf decorator (or annotation) to define >>>>> vectorized UDFs which takes one or more pandas.Series or one integer >>>>> value meaning the length of the input value for 0-parameter UDFs. The >>>>> return value should be pandas.Series of the specified type and the >>>>> length of the returned value should be the same as input value. >>>>> >>>>> We can define vectorized UDFs as: >>>>> >>>>> @pandas_udf(DoubleType()) >>>>> def plus(v1, v2): >>>>> return v1 + v2 >>>>> >>>>> or we can define as: >>>>> >>>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>>> >>>>> We can use it similar to row-by-row UDFs: >>>>> >>>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>>> >>>>> As for 0-parameter UDFs, we can define and use as: >>>>> >>>>> @pandas_udf(LongType()) >>>>> def f0(size): >>>>> return pd.Series(1).repeat(size) >>>>> >>>>> df.select(f0()) >>>>> >>>>> >>>>> >>>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>>> >>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>> +0: Don't really care. >>>>> -1: I don't think this is a good idea because of the following technical >>>>> reasons. >>>>> >>>>> Thanks! >>>>> >>>>> -- >>>>> Takuya UESHIN >>>>> Tokyo, Japan >>>>> >>>>> http://twitter.com/ueshin >>>>> >>>> >>> >>> >>> -- >>> Takuya UESHIN >>> Tokyo, Japan >>> >>> http://twitter.com/ueshin >>> >> > -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin