Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Yin Huai Mon, 11 Sep 2017 17:53:23 -0700

+1

On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <[email protected]>
wrote:


> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <[email protected]> wrote:
>
>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>> fine to work out the minor details of the API during review.
>>
>> Bryan
>>
>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thank you for voting and suggestions.
>>>
>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>> discuss the size hint for the 0-parameter UDF.
>>> But I believe we got a consensus about the basic APIs except for the
>>> size hint, I'd like to submit a pr based on the current proposal and
>>> continue discussing in its review.
>>>
>>>     https://github.com/apache/spark/pull/19147
>>>
>>> I'd keep this vote open to wait for more opinions.
>>>
>>> Thanks.
>>>
>>>
>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <[email protected]> wrote:
>>>
>>>> +1 on the design and proposed API.
>>>>
>>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>>> specify the size hint. This can be done in the PR review though.
>>>>
>>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <[email protected]
>>>> > wrote:
>>>>
>>>>> +1 on this and like the suggestion of type in string form.
>>>>>
>>>>> Would it be correct to assume there will be data type check, for
>>>>> example the returned pandas data frame column data types match what are
>>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>>
>>>>> Would it make sense to have a more generic decorator name so that it
>>>>> could also be useable for other efficient vectorized format in the future?
>>>>> Or do we anticipate the decorator to be format specific and will have more
>>>>> in the future?
>>>>>
>>>>> ------------------------------
>>>>> *From:* Reynold Xin <[email protected]>
>>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>>> *To:* Takuya UESHIN
>>>>> *Cc:* spark-dev
>>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>>
>>>>> Ok, thanks.
>>>>>
>>>>> +1 on the SPIP for scope etc
>>>>>
>>>>>
>>>>> On API details (will deal with in code reviews as well but leaving a
>>>>> note here in case I forget)
>>>>>
>>>>> 1. I would suggest having the API also accept data type specification
>>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>>
>>>>> 2. Think about what error message to show when the rows numbers don't
>>>>> match at runtime.
>>>>>
>>>>>
>>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes, the aggregation is out of scope for now.
>>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>>> will be adding those later separately.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>>>> will be adding those later?
>>>>>>>
>>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>>>> almost got a consensus about the APIs, so I'd like to summarize
>>>>>>>> and call for a vote.
>>>>>>>>
>>>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>>
>>>>>>>>
>>>>>>>> *Proposed API*
>>>>>>>>
>>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>>> integer value meaning the length of the input value for 0-parameter 
>>>>>>>> UDFs.
>>>>>>>> The return value should be pandas.Series of the specified type and
>>>>>>>> the length of the returned value should be the same as input value.
>>>>>>>>
>>>>>>>> We can define vectorized UDFs as:
>>>>>>>>
>>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>>   def plus(v1, v2):
>>>>>>>>       return v1 + v2
>>>>>>>>
>>>>>>>> or we can define as:
>>>>>>>>
>>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>>
>>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>>
>>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>>
>>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>>
>>>>>>>>   @pandas_udf(LongType())
>>>>>>>>   def f0(size):
>>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>>
>>>>>>>>   df.select(f0())
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>>> vote:
>>>>>>>>
>>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>> +0: Don't really care.
>>>>>>>> -1: I don't think this is a good idea because of the following 
>>>>>>>> technical
>>>>>>>> reasons.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Takuya UESHIN
>>>>>>>> Tokyo, Japan
>>>>>>>>
>>>>>>>> http://twitter.com/ueshin
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Takuya UESHIN
>>>>>> Tokyo, Japan
>>>>>>
>>>>>> http://twitter.com/ueshin
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to