Yep that's one approach. That may not really re-read the data N times; for
example if the filtering aligns with partitioning, you'd be reading subsets
each time. You can also cache the input first to avoid I/O N times.
But again I wonder if you are at a scale that really needs distributed
training.

On Thu, Jan 21, 2021 at 1:52 PM Riccardo Ferrari <ferra...@gmail.com> wrote:

> Thanks for the answers.
>
> I am trying to avoid reading the same data multiple times (each per model).
>
> One approach I can think of is 'filtering' on the column I want to split
> on and train each model. I was hoping to find a more elegant approach.
>
>
>
> On Thu, Jan 21, 2021 at 5:28 PM Sean Owen <sro...@gmail.com> wrote:
>
>> If you mean you want to train N models in parallel, you wouldn't be able
>> to do that with a groupBy first. You apply logic to the result of groupBy
>> with Spark, but can't use Spark within Spark. You can run N Spark jobs in
>> parallel on the driver but you'd have to have each read the subset of data
>> that it's meant to model separately.
>>
>> A pandas UDF is a fine solution here, because I assume that implies your
>> groups aren't that big, so, maybe no need for a Spark pipeline.
>>
>>
>> On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <ferra...@gmail.com>
>> wrote:
>>
>>> Hi list,
>>>
>>> I am looking for an efficient solution to apply a training pipeline to
>>> each group of a DataFrame.groupBy.
>>>
>>> This is very easy if you're using a pandas udf (i.e. groupBy().apply()),
>>> I am not able to find the equivalent for a spark pipeline.
>>>
>>> The ultimate goal is to fit multiple models, one per group of data.
>>>
>>> Thanks,
>>>
>>>

Reply via email to