Hi,

Currently the RandomForest algo takes a single maxBins value to decide the
number of splits to take. This sometimes causes training time to go very
high when there is a single categorical column having sufficiently large
number of unique values. This single column impacts all the numeric
(continuous) columns even though such a high number of splits are not
required.

Encoding the  categorical column into features make the data very wide and
this requires us to increase the maxMemoryInMB and puts more pressure on the
GC as well.

Keeping the separate maxBins values for categorial and continuous features
should be useful in this regard.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Different-maxBins-value-for-categorical-and-continuous-features-in-RandomForest-implementation-tp17099.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to