+1 (non-binding)
2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>: > +1 > > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sam...@databricks.com> > wrote: > >> +1 (non-binding) >> >> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutl...@gmail.com> wrote: >> >>> +1 (non-binding) for the goals and non-goals of this SPIP. I think it's >>> fine to work out the minor details of the API during review. >>> >>> Bryan >>> >>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st> >>> wrote: >>> >>>> Hi all, >>>> >>>> Thank you for voting and suggestions. >>>> >>>> As Wenchen mentioned and also we're discussing at JIRA, we need to >>>> discuss the size hint for the 0-parameter UDF. >>>> But I believe we got a consensus about the basic APIs except for the >>>> size hint, I'd like to submit a pr based on the current proposal and >>>> continue discussing in its review. >>>> >>>> https://github.com/apache/spark/pull/19147 >>>> >>>> I'd keep this vote open to wait for more opinions. >>>> >>>> Thanks. >>>> >>>> >>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> >>>>> +1 on the design and proposed API. >>>>> >>>>> One detail I'd like to discuss is the 0-parameter UDF, how we can >>>>> specify the size hint. This can be done in the PR review though. >>>>> >>>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung < >>>>> felixcheun...@hotmail.com> wrote: >>>>> >>>>>> +1 on this and like the suggestion of type in string form. >>>>>> >>>>>> Would it be correct to assume there will be data type check, for >>>>>> example the returned pandas data frame column data types match what are >>>>>> specified. We have seen quite a bit of issues/confusions with that in R. >>>>>> >>>>>> Would it make sense to have a more generic decorator name so that it >>>>>> could also be useable for other efficient vectorized format in the >>>>>> future? >>>>>> Or do we anticipate the decorator to be format specific and will have >>>>>> more >>>>>> in the future? >>>>>> >>>>>> ------------------------------ >>>>>> *From:* Reynold Xin <r...@databricks.com> >>>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM >>>>>> *To:* Takuya UESHIN >>>>>> *Cc:* spark-dev >>>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >>>>>> >>>>>> Ok, thanks. >>>>>> >>>>>> +1 on the SPIP for scope etc >>>>>> >>>>>> >>>>>> On API details (will deal with in code reviews as well but leaving a >>>>>> note here in case I forget) >>>>>> >>>>>> 1. I would suggest having the API also accept data type specification >>>>>> in string form. It is usually simpler to say "long" then "LongType()". >>>>>> >>>>>> 2. Think about what error message to show when the rows numbers don't >>>>>> match at runtime. >>>>>> >>>>>> >>>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> >>>>>> wrote: >>>>>> >>>>>>> Yes, the aggregation is out of scope for now. >>>>>>> I think we should continue discussing the aggregation at JIRA and we >>>>>>> will be adding those later separately. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Is the idea aggregate is out of scope for the current effort and we >>>>>>>> will be adding those later? >>>>>>>> >>>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN < >>>>>>>> ues...@happy-camper.st> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> We've been discussing to support vectorized UDFs in Python and we >>>>>>>>> almost got a consensus about the APIs, so I'd like to summarize >>>>>>>>> and call for a vote. >>>>>>>>> >>>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not >>>>>>>>> APIs for vectorized UDAFs or Window operations. >>>>>>>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>>>>>>> >>>>>>>>> >>>>>>>>> *Proposed API* >>>>>>>>> >>>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define >>>>>>>>> vectorized UDFs which takes one or more pandas.Series or one >>>>>>>>> integer value meaning the length of the input value for 0-parameter >>>>>>>>> UDFs. >>>>>>>>> The return value should be pandas.Series of the specified type >>>>>>>>> and the length of the returned value should be the same as input >>>>>>>>> value. >>>>>>>>> >>>>>>>>> We can define vectorized UDFs as: >>>>>>>>> >>>>>>>>> @pandas_udf(DoubleType()) >>>>>>>>> def plus(v1, v2): >>>>>>>>> return v1 + v2 >>>>>>>>> >>>>>>>>> or we can define as: >>>>>>>>> >>>>>>>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>>>>>>> >>>>>>>>> We can use it similar to row-by-row UDFs: >>>>>>>>> >>>>>>>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>>>>>>> >>>>>>>>> As for 0-parameter UDFs, we can define and use as: >>>>>>>>> >>>>>>>>> @pandas_udf(LongType()) >>>>>>>>> def f0(size): >>>>>>>>> return pd.Series(1).repeat(size) >>>>>>>>> >>>>>>>>> df.select(f0()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>>>> vote: >>>>>>>>> >>>>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>>>> +0: Don't really care. >>>>>>>>> -1: I don't think this is a good idea because of the following >>>>>>>>> technical >>>>>>>>> reasons. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Takuya UESHIN >>>>>>>>> Tokyo, Japan >>>>>>>>> >>>>>>>>> http://twitter.com/ueshin >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Takuya UESHIN >>>>>>> Tokyo, Japan >>>>>>> >>>>>>> http://twitter.com/ueshin >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Takuya UESHIN >>>> Tokyo, Japan >>>> >>>> http://twitter.com/ueshin >>>> >>> >>> >> >> >> -- >> Sameer Agarwal >> Software Engineer | Databricks Inc. >> http://cs.berkeley.edu/~sameerag >> > >