I'm using dataframes, types are all doubles and I'm only extracting what I
need.
The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorith
Before hardware optimization there is always software optimization.
Are you using dataset / dataframe? Are you using the right data types ( eg int
where int is appropriate , try to avoid string and char etc)
Do you extract only the stuff needed? What are the algorithm parameters?
> On 07 Jun 201
Hi,
I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am
interested in how it might be best to scale it - e.g more cpus per
instances, more memory per instance, more instances etc.
I'm currently using 32 m3.xlarge instances for for a training set with 2.5
million rows, 1300 c