Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Noman Khan Wed, 13 Sep 2017 19:45:08 -0700

+1(non-binding)

Regards
Noman
________________________________
From: Xiao Li <[email protected]>
Sent: Tuesday, September 12, 2017 2:44:26 AM
To: Matei Zaharia; Hyukjin Kwon
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python


+1

Xiao
On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia 
<[email protected]<mailto:[email protected]>> wrote:
+1 (binding)

> On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon 
> <[email protected]<mailto:[email protected]>> wrote:
>
> +1 (non-binding)
>
>
> 2017-09-12 9:52 GMT+09:00 Yin Huai 
> <[email protected]<mailto:[email protected]>>:
> +1
>
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
> <[email protected]<mailto:[email protected]>> wrote:
> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler 
> <[email protected]<mailto:[email protected]>> wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's fine 
> to work out the minor details of the API during review.
>
> Bryan
>
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
> <[email protected]<mailto:[email protected]>> wrote:
> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss 
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size 
> hint, I'd like to submit a pr based on the current proposal and continue 
> discussing in its review.
>
>     https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan 
> <[email protected]<mailto:[email protected]>> wrote:
> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify the 
> size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
> <[email protected]<mailto:[email protected]>> wrote:
> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for example the 
> returned pandas data frame column data types match what are specified. We 
> have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it could 
> also be useable for other efficient vectorized format in the future? Or do we 
> anticipate the decorator to be format specific and will have more in the 
> future?
>
> From: Reynold Xin <[email protected]<mailto:[email protected]>>
> Sent: Friday, September 1, 2017 5:16:11 AM
> To: Takuya UESHIN
> Cc: spark-dev
> Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a note 
> here in case I forget)
>
> 1. I would suggest having the API also accept data type specification in 
> string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't match 
> at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
> <[email protected]<mailto:[email protected]>> wrote:
> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will be 
> adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
> <[email protected]<mailto:[email protected]>> wrote:
> Is the idea aggregate is out of scope for the current effort and we will be 
> adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> <[email protected]<mailto:[email protected]>> wrote:
> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost got 
> a consensus about the APIs, so I'd like to summarize and call for a vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for 
> vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> Proposed API
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized 
> UDFs which takes one or more pandas.Series or one integer value meaning the 
> length of the input value for 0-parameter UDFs. The return value should be 
> pandas.Series of the specified type and the length of the returned value 
> should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>       return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>       return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical 
> reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>
>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: 
[email protected]<mailto:[email protected]>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to