If you mean you want to train N models in parallel, you wouldn't be able to
do that with a groupBy first. You apply logic to the result of groupBy with
Spark, but can't use Spark within Spark. You can run N Spark jobs in
parallel on the driver but you'd have to have each read the subset of data
that it's meant to model separately.

A pandas UDF is a fine solution here, because I assume that implies your
groups aren't that big, so, maybe no need for a Spark pipeline.


On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <ferra...@gmail.com> wrote:

> Hi list,
>
> I am looking for an efficient solution to apply a training pipeline to
> each group of a DataFrame.groupBy.
>
> This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I
> am not able to find the equivalent for a spark pipeline.
>
> The ultimate goal is to fit multiple models, one per group of data.
>
> Thanks,
>
>

Reply via email to