+1 On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sam...@databricks.com> wrote:
> +1 (non-binding) > > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutl...@gmail.com> wrote: > >> +1 (non-binding) for the goals and non-goals of this SPIP. I think it's >> fine to work out the minor details of the API during review. >> >> Bryan >> >> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st> >> wrote: >> >>> Hi all, >>> >>> Thank you for voting and suggestions. >>> >>> As Wenchen mentioned and also we're discussing at JIRA, we need to >>> discuss the size hint for the 0-parameter UDF. >>> But I believe we got a consensus about the basic APIs except for the >>> size hint, I'd like to submit a pr based on the current proposal and >>> continue discussing in its review. >>> >>> https://github.com/apache/spark/pull/19147 >>> >>> I'd keep this vote open to wait for more opinions. >>> >>> Thanks. >>> >>> >>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> +1 on the design and proposed API. >>>> >>>> One detail I'd like to discuss is the 0-parameter UDF, how we can >>>> specify the size hint. This can be done in the PR review though. >>>> >>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com >>>> > wrote: >>>> >>>>> +1 on this and like the suggestion of type in string form. >>>>> >>>>> Would it be correct to assume there will be data type check, for >>>>> example the returned pandas data frame column data types match what are >>>>> specified. We have seen quite a bit of issues/confusions with that in R. >>>>> >>>>> Would it make sense to have a more generic decorator name so that it >>>>> could also be useable for other efficient vectorized format in the future? >>>>> Or do we anticipate the decorator to be format specific and will have more >>>>> in the future? >>>>> >>>>> ------------------------------ >>>>> *From:* Reynold Xin <r...@databricks.com> >>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM >>>>> *To:* Takuya UESHIN >>>>> *Cc:* spark-dev >>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >>>>> >>>>> Ok, thanks. >>>>> >>>>> +1 on the SPIP for scope etc >>>>> >>>>> >>>>> On API details (will deal with in code reviews as well but leaving a >>>>> note here in case I forget) >>>>> >>>>> 1. I would suggest having the API also accept data type specification >>>>> in string form. It is usually simpler to say "long" then "LongType()". >>>>> >>>>> 2. Think about what error message to show when the rows numbers don't >>>>> match at runtime. >>>>> >>>>> >>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> >>>>> wrote: >>>>> >>>>>> Yes, the aggregation is out of scope for now. >>>>>> I think we should continue discussing the aggregation at JIRA and we >>>>>> will be adding those later separately. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Is the idea aggregate is out of scope for the current effort and we >>>>>>> will be adding those later? >>>>>>> >>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> We've been discussing to support vectorized UDFs in Python and we >>>>>>>> almost got a consensus about the APIs, so I'd like to summarize >>>>>>>> and call for a vote. >>>>>>>> >>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not >>>>>>>> APIs for vectorized UDAFs or Window operations. >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>>>>>> >>>>>>>> >>>>>>>> *Proposed API* >>>>>>>> >>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define >>>>>>>> vectorized UDFs which takes one or more pandas.Series or one >>>>>>>> integer value meaning the length of the input value for 0-parameter >>>>>>>> UDFs. >>>>>>>> The return value should be pandas.Series of the specified type and >>>>>>>> the length of the returned value should be the same as input value. >>>>>>>> >>>>>>>> We can define vectorized UDFs as: >>>>>>>> >>>>>>>> @pandas_udf(DoubleType()) >>>>>>>> def plus(v1, v2): >>>>>>>> return v1 + v2 >>>>>>>> >>>>>>>> or we can define as: >>>>>>>> >>>>>>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>>>>>> >>>>>>>> We can use it similar to row-by-row UDFs: >>>>>>>> >>>>>>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>>>>>> >>>>>>>> As for 0-parameter UDFs, we can define and use as: >>>>>>>> >>>>>>>> @pandas_udf(LongType()) >>>>>>>> def f0(size): >>>>>>>> return pd.Series(1).repeat(size) >>>>>>>> >>>>>>>> df.select(f0()) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>>> vote: >>>>>>>> >>>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>>> +0: Don't really care. >>>>>>>> -1: I don't think this is a good idea because of the following >>>>>>>> technical >>>>>>>> reasons. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> -- >>>>>>>> Takuya UESHIN >>>>>>>> Tokyo, Japan >>>>>>>> >>>>>>>> http://twitter.com/ueshin >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Takuya UESHIN >>>>>> Tokyo, Japan >>>>>> >>>>>> http://twitter.com/ueshin >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Takuya UESHIN >>> Tokyo, Japan >>> >>> http://twitter.com/ueshin >>> >> >> > > > -- > Sameer Agarwal > Software Engineer | Databricks Inc. > http://cs.berkeley.edu/~sameerag >