Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
Yep that's one approach. That may not really re-read the data N times; for example if the filtering aligns with partitioning, you'd be reading subsets each time. You can also cache the input first to avoid I/O N times. But again I wonder if you are at a scale that really needs distributed training.

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
Thanks for the answers. I am trying to avoid reading the same data multiple times (each per model). One approach I can think of is 'filtering' on the column I want to split on and train each model. I was hoping to find a more elegant approach. On Thu, Jan 21, 2021 at 5:28 PM Sean Owen wrote:

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Mich Talebzadeh
I guess one drawback would be that the data cannot be processed and stored in Pandas DataFrames as these DataFrames store data in RAM. If you are going to run multiple parallel jobs then a single machine may not be viable? On Thu, 21 Jan 2021 at 16:29, Sean Owen wrote: > If you mean you want

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data tha