Because both dapply in R and Scalar Pandas UDF in Python are similar, and cover each other. FWIW, it somewhat sounds like SPARK-26413 and SPARK-26412
2019년 3월 9일 (토) 오후 12:32, peng yu <yupb...@gmail.com>님이 작성: > Cool, thanks for letting me know, but why not support dapply > http://spark.apache.org/docs/2.0.0/api/R/dapply.html as supported in R, > so we can just pass in a pandas dataframe > > On Fri, Mar 8, 2019 at 6:09 PM Li Jin <ice.xell...@gmail.com> wrote: > >> Hi, >> >> Pandas UDF supports input as struct type. However, note that it will be >> turned into python dict because pandas itself does not have native struct >> type. >> On Fri, Mar 8, 2019 at 2:55 PM peng yu <yupb...@gmail.com> wrote: >> >>> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF >>> support input is a StructType too ? >>> >>> On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <cutl...@gmail.com> wrote: >>> >>>> Hi Peng, >>>> >>>> I just added support for scalar Pandas UDF to return a StructType as a >>>> Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. >>>> Is that the functionality you are looking for? >>>> >>>> Bryan >>>> >>>> On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote: >>>> >>>>> right now, i'm using the colums-at-a-time mapping >>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129 >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> Maybe, it depends on what you're doing. It sounds like you are trying >>>>>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what >>>>>> you're doing vectorized? may not help much. >>>>>> Just make the pandas Series into a DataFrame if you want? and a single >>>>>> col back to Series? >>>>>> >>>>>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote: >>>>>> > >>>>>> > pandas/arrow is for the memory efficiency, and mapPartitions is >>>>>> only available to rdds, for sure i can do everything in rdd. >>>>>> > >>>>>> > But i thought that's the whole point of having pandas_udf, so my >>>>>> program run faster and consumes less memory ? >>>>>> > >>>>>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >> >>>>>> >> Are you just applying a function to every row in the DataFrame? you >>>>>> >> don't need pandas at all. Just get the RDD of Row from it and map a >>>>>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF >>>>>> >> that operates on all columns and returns a new value. >>>>>> mapPartitions is >>>>>> >> also available if you want to transform an iterator of Row to >>>>>> another >>>>>> >> iterator of Row. >>>>>> >> >>>>>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote: >>>>>> >> > >>>>>> >> > it is very similar to SCALAR, but for SCALAR the output can't be >>>>>> struct/row and the input has to be pd.Series, which doesn't support a >>>>>> row. >>>>>> >> > >>>>>> >> > I'm doing tensorflow batch inference in spark, >>>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 >>>>>> >> > >>>>>> >> > Which i have to do the groupBy in order to use the apply >>>>>> function, i'm wondering why not just enable apply to df ? >>>>>> >> > >>>>>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> >>>>>> wrote: >>>>>> >> >> >>>>>> >> >> Are you looking for SCALAR? that lets you map one row to one >>>>>> row, but >>>>>> >> >> do it more efficiently in batch. What are you trying to do? >>>>>> >> >> >>>>>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> >>>>>> wrote: >>>>>> >> >> > >>>>>> >> >> > I'm looking for a mapPartition(pandas_udf) for a >>>>>> pyspark.Dataframe. >>>>>> >> >> > >>>>>> >> >> > ``` >>>>>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP) >>>>>> >> >> > def do_nothing(pandas_df): >>>>>> >> >> > return pandas_df >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > new_df = df.mapPartition(do_nothing) >>>>>> >> >> > ``` >>>>>> >> >> > pandas_udf only support scala or GROUPED_MAP. Why not >>>>>> support just Map? >>>>>> >> >> > >>>>>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> >>>>>> wrote: >>>>>> >> >> >> >>>>>> >> >> >> Are you looking for @pandas_udf in Python? Or just >>>>>> mapPartition? Those exist already >>>>>> >> >> >> >>>>>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> >>>>>> wrote: >>>>>> >> >> >>> >>>>>> >> >> >>> There is a nice map_partition function in R `dapply`. so >>>>>> that user can pass a row to udf. >>>>>> >> >> >>> >>>>>> >> >> >>> I'm wondering why we don't have that in python? >>>>>> >> >> >>> >>>>>> >> >> >>> I'm trying to have a map_partition function with pandas_udf >>>>>> supported >>>>>> >> >> >>> >>>>>> >> >> >>> thanks! >>>>>> >>>>>