Yep that's one approach. That may not really re-read the data N times; for
example if the filtering aligns with partitioning, you'd be reading subsets
each time. You can also cache the input first to avoid I/O N times.
But again I wonder if you are at a scale that really needs distributed
training.
Thanks for the answers.
I am trying to avoid reading the same data multiple times (each per model).
One approach I can think of is 'filtering' on the column I want to split on
and train each model. I was hoping to find a more elegant approach.
On Thu, Jan 21, 2021 at 5:28 PM Sean Owen wrote:
I guess one drawback would be that the data cannot be processed and stored
in Pandas DataFrames as these DataFrames store data in RAM. If you are
going to run multiple parallel jobs then a single machine may not be viable?
On Thu, 21 Jan 2021 at 16:29, Sean Owen wrote:
> If you mean you want
If you mean you want to train N models in parallel, you wouldn't be able to
do that with a groupBy first. You apply logic to the result of groupBy with
Spark, but can't use Spark within Spark. You can run N Spark jobs in
parallel on the driver but you'd have to have each read the subset of data
tha