Re: [pyspark] dataframe map_partition

peng yu Thu, 07 Mar 2019 12:59:15 -0800

it is very similar to SCALAR, but for SCALAR the output can't be struct/row
and the input has to be pd.Series, which doesn't support a row.


I'm doing tensorflow batch inference in spark,
https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108

Which i have to do the groupBy in order to use the apply function, i'm
wondering why not just enable apply to df ?

On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote:

> Are you looking for SCALAR? that lets you map one row to one row, but
> do it more efficiently in batch. What are you trying to do?
>
> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote:
> >
> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
> >
> > ```
> > @pandas_udf(df.schema, PandasUDFType.MAP)
> > def do_nothing(pandas_df):
> >     return pandas_df
> >
> >
> > new_df = df.mapPartition(do_nothing)
> > ```
> > pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
> >
> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> wrote:
> >>
> >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those
> exist already
> >>
> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote:
> >>>
> >>> There is a nice map_partition function in R `dapply`.  so that user
> can pass a row to udf.
> >>>
> >>> I'm wondering why we don't have that in python?
> >>>
> >>> I'm trying to have a map_partition function with pandas_udf supported
> >>>
> >>> thanks!
>

Re: [pyspark] dataframe map_partition

Reply via email to