Re: [pyspark] dataframe map_partition

2019-03-10 Thread Hyukjin Kwon
Because both dapply in R and Scalar Pandas UDF in Python are similar, and cover each other. FWIW, it somewhat sounds like SPARK-26413 and SPARK-26412 2019년 3월 9일 (토) 오후 12:32, peng yu 님이 작성: > Cool, thanks for letting me know, but why not support dapply > http://spark.apache.org/docs/2.0.0/api/R

Re: [pyspark] dataframe map_partition

2019-03-08 Thread peng yu
Cool, thanks for letting me know, but why not support dapply http://spark.apache.org/docs/2.0.0/api/R/dapply.html as supported in R, so we can just pass in a pandas dataframe On Fri, Mar 8, 2019 at 6:09 PM Li Jin wrote: > Hi, > > Pandas UDF supports input as struct type. However, note that it wi

Re: [pyspark] dataframe map_partition

2019-03-08 Thread Li Jin
Hi, Pandas UDF supports input as struct type. However, note that it will be turned into python dict because pandas itself does not have native struct type. On Fri, Mar 8, 2019 at 2:55 PM peng yu wrote: > Yeah, that seems most likely i have wanted, does the scalar Pandas UDF > support input is a

Re: [pyspark] dataframe map_partition

2019-03-08 Thread peng yu
Yeah, that seems most likely i have wanted, does the scalar Pandas UDF support input is a StructType too ? On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler wrote: > Hi Peng, > > I just added support for scalar Pandas UDF to return a StructType as a > Pandas DataFrame in https://issues.apache.org/jira

Re: [pyspark] dataframe map_partition

2019-03-08 Thread Bryan Cutler
Hi Peng, I just added support for scalar Pandas UDF to return a StructType as a Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. Is that the functionality you are looking for? Bryan On Thu, Mar 7, 2019 at 1:13 PM peng yu wrote: > right now, i'm using the colums-at-a-time

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
right now, i'm using the colums-at-a-time mapping https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129 On Thu, Mar 7, 2019 at 4:00 PM Sean Owen wrote: > Maybe, it depends on what you're doing. It sounds like you are trying > to do row-at-a-time mapping, even on a pandas Da

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Maybe, it depends on what you're doing. It sounds like you are trying to do row-at-a-time mapping, even on a pandas DataFrame. Is what you're doing vectorized? may not help much. Just make the pandas Series into a DataFrame if you want? and a single col back to Series? On Thu, Mar 7, 2019 at 2:45

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
it is very similar to SCALAR, but for SCALAR the output can't be struct/row and the input has to be pd.Series, which doesn't support a row. I'm doing tensorflow batch inference in spark, https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 Which i have to do the groupBy in

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
and in this case, i'm actually benefiting from the columns of arrow support, so that i can pass the whole data block to tensorflow to obtain the block of prediction all at once. On Thu, Mar 7, 2019 at 3:45 PM peng yu wrote: > pandas/arrow is for the memory efficiency, and mapPartitions is only

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
pandas/arrow is for the memory efficiency, and mapPartitions is only available to rdds, for sure i can do everything in rdd. But i thought that's the whole point of having pandas_udf, so my program run faster and consumes less memory ? On Thu, Mar 7, 2019 at 3:40 PM Sean Owen wrote: > Are you j

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you just applying a function to every row in the DataFrame? you don't need pandas at all. Just get the RDD of Row from it and map a UDF that makes another Row, and go back to DataFrame. Or make a UDF that operates on all columns and returns a new value. mapPartitions is also available if you wa

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for SCALAR? that lets you map one row to one row, but do it more efficiently in batch. What are you trying to do? On Thu, Mar 7, 2019 at 2:03 PM peng yu wrote: > > I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. > > ``` > @pandas_udf(df.schema, PandasUDFType.M

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. ``` @pandas_udf(df.schema, PandasUDFType.MAP) def do_nothing(pandas_df): return pandas_df new_df = df.mapPartition(do_nothing) ``` pandas_udf only support scala or GROUPED_MAP. Why not support just Map? On Thu, Mar 7, 201

Re: [pyspark] dataframe map_partition

2019-03-07 Thread Sean Owen
Are you looking for @pandas_udf in Python? Or just mapPartition? Those exist already On Thu, Mar 7, 2019, 1:43 PM peng yu wrote: > There is a nice map_partition function in R `dapply`. so that user can > pass a row to udf. > > I'm wondering why we don't have that in python? > > I'm trying to ha

[pyspark] dataframe map_partition

2019-03-07 Thread peng yu
There is a nice map_partition function in R `dapply`. so that user can pass a row to udf. I'm wondering why we don't have that in python? I'm trying to have a map_partition function with pandas_udf supported thanks!