it is very similar to SCALAR, but for SCALAR the output can't be struct/row and the input has to be pd.Series, which doesn't support a row.
I'm doing tensorflow batch inference in spark, https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 Which i have to do the groupBy in order to use the apply function, i'm wondering why not just enable apply to df ? On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote: > Are you looking for SCALAR? that lets you map one row to one row, but > do it more efficiently in batch. What are you trying to do? > > On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote: > > > > I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. > > > > ``` > > @pandas_udf(df.schema, PandasUDFType.MAP) > > def do_nothing(pandas_df): > > return pandas_df > > > > > > new_df = df.mapPartition(do_nothing) > > ``` > > pandas_udf only support scala or GROUPED_MAP. Why not support just Map? > > > > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> wrote: > >> > >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those > exist already > >> > >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote: > >>> > >>> There is a nice map_partition function in R `dapply`. so that user > can pass a row to udf. > >>> > >>> I'm wondering why we don't have that in python? > >>> > >>> I'm trying to have a map_partition function with pandas_udf supported > >>> > >>> thanks! >