Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
I'm using dataframes, types are all doubles and I'm only extracting what I need. The caveat on these is that I am porting an existing system for a client and for there business it's likely to be cheaper to throw hardware (in aws) at the problem for a couple of hours than re-engineer there algorith

Re: Advice on Scaling RandomForest

2016-06-07 Thread Jörn Franke
Before hardware optimization there is always software optimization. Are you using dataset / dataframe? Are you using the right data types ( eg int where int is appropriate , try to avoid string and char etc) Do you extract only the stuff needed? What are the algorithm parameters? > On 07 Jun 201

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300 c