Why does sparkml random forest classifier not support maxBins < number of total categorical values?

Reed Villanueva Tue, 15 Jun 2021 23:33:39 -0700

Why does sparkml's random forest classifier not support maxBins
<https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.maxBins>
(M)
< (K) number of total categorical values?


My understanding of decision tree bins is that...

Statistical data binning is basically a form of quantization where you map
> a set of numbers with continuous values into *smaller*, more manageable
> “bins.”

https://clevertap.com/blog/numerical-vs-categorical-variables-decision-trees/

...which makes it seem like you wouldn't ever really want to use M > K in
any case, yet the docs seem to imply that is not the case.

Must be >=2 and >= number of categories for any categorical feature

Plus, when I use the random forest implementation in H2O
<https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html>, I
do have the option of using less bins that the total number of distinct
categorical values.

Could anyone explain the reason for this restriction in spark? Is there
some kind of particular data preprocessing / feature engineering users are
expected to have done beforehand? Am I misunderstanding something about
decision trees (eg. is it categorical don't really ever *need* to be binned
in the first place and the setting is just for numerical values or
something)?

Why does sparkml random forest classifier not support maxBins < number of total categorical values?

Reply via email to