Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Hyukjin Kwon Mon, 11 Sep 2017 17:55:15 -0700

+1 (non-binding)


2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>:

> +1
>
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sam...@databricks.com>
> wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>>> fine to work out the minor details of the API during review.
>>>
>>> Bryan
>>>
>>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thank you for voting and suggestions.
>>>>
>>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>>> discuss the size hint for the 0-parameter UDF.
>>>> But I believe we got a consensus about the basic APIs except for the
>>>> size hint, I'd like to submit a pr based on the current proposal and
>>>> continue discussing in its review.
>>>>
>>>>     https://github.com/apache/spark/pull/19147
>>>>
>>>> I'd keep this vote open to wait for more opinions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 on the design and proposed API.
>>>>>
>>>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>>>> specify the size hint. This can be done in the PR review though.
>>>>>
>>>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> +1 on this and like the suggestion of type in string form.
>>>>>>
>>>>>> Would it be correct to assume there will be data type check, for
>>>>>> example the returned pandas data frame column data types match what are
>>>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>>>
>>>>>> Would it make sense to have a more generic decorator name so that it
>>>>>> could also be useable for other efficient vectorized format in the 
>>>>>> future?
>>>>>> Or do we anticipate the decorator to be format specific and will have 
>>>>>> more
>>>>>> in the future?
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Reynold Xin <r...@databricks.com>
>>>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>>>> *To:* Takuya UESHIN
>>>>>> *Cc:* spark-dev
>>>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>>>
>>>>>> Ok, thanks.
>>>>>>
>>>>>> +1 on the SPIP for scope etc
>>>>>>
>>>>>>
>>>>>> On API details (will deal with in code reviews as well but leaving a
>>>>>> note here in case I forget)
>>>>>>
>>>>>> 1. I would suggest having the API also accept data type specification
>>>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>>>
>>>>>> 2. Think about what error message to show when the rows numbers don't
>>>>>> match at runtime.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, the aggregation is out of scope for now.
>>>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>>>> will be adding those later separately.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>>>>> will be adding those later?
>>>>>>>>
>>>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <
>>>>>>>> ues...@happy-camper.st> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>>>>> almost got a consensus about the APIs, so I'd like to summarize
>>>>>>>>> and call for a vote.
>>>>>>>>>
>>>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Proposed API*
>>>>>>>>>
>>>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>>>> integer value meaning the length of the input value for 0-parameter 
>>>>>>>>> UDFs.
>>>>>>>>> The return value should be pandas.Series of the specified type
>>>>>>>>> and the length of the returned value should be the same as input 
>>>>>>>>> value.
>>>>>>>>>
>>>>>>>>> We can define vectorized UDFs as:
>>>>>>>>>
>>>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>>>   def plus(v1, v2):
>>>>>>>>>       return v1 + v2
>>>>>>>>>
>>>>>>>>> or we can define as:
>>>>>>>>>
>>>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>>>
>>>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>>>
>>>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>>>
>>>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>>>
>>>>>>>>>   @pandas_udf(LongType())
>>>>>>>>>   def f0(size):
>>>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>>>
>>>>>>>>>   df.select(f0())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>>> vote:
>>>>>>>>>
>>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>> +0: Don't really care.
>>>>>>>>> -1: I don't think this is a good idea because of the following 
>>>>>>>>> technical
>>>>>>>>> reasons.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Takuya UESHIN
>>>>>>>>> Tokyo, Japan
>>>>>>>>>
>>>>>>>>> http://twitter.com/ueshin
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Takuya UESHIN
>>>>>>> Tokyo, Japan
>>>>>>>
>>>>>>> http://twitter.com/ueshin
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>>
>>
>>
>> --
>> Sameer Agarwal
>> Software Engineer | Databricks Inc.
>> http://cs.berkeley.edu/~sameerag
>>
>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to