Hi,

We have been using Pyspark's groupby().apply() quite a bit and it has been
very helpful in integrating Spark with our existing pandas-heavy libraries.

Recently, we have found more and more cases where groupby().apply() is not
sufficient - In some cases, we want to group two dataframes by the same
key, and apply a function which takes two pd.DataFrame (also returns a
pd.DataFrame) for each key. This feels very much like the "cogroup"
operation in the RDD API.

It would be great to be able to do sth like this: (not actual API, just to
explain the use case):

@pandas_udf(return_schema, ...)
def my_udf(pdf1, pdf2)
     # pdf1 and pdf2 are the subset of the original dataframes that is
associated with a particular key
     result = ... # some code that uses pdf1 and pdf2
     return result

df3  = cogroup(df1, df2, key='some_key').apply(my_udf)

I have searched around the problem and some people have suggested to join
the tables first. However, it's often not the same pattern and hard to get
it to work by using joins.

I wonder what are people's thought on this?

Li

Reply via email to