Why does sparkml's random forest classifier not support maxBins <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.maxBins> (M) < (K) number of total categorical values?
My understanding of decision tree bins is that... Statistical data binning is basically a form of quantization where you map > a set of numbers with continuous values into *smaller*, more manageable > “bins.” https://clevertap.com/blog/numerical-vs-categorical-variables-decision-trees/ ...which makes it seem like you wouldn't ever really want to use M > K in any case, yet the docs seem to imply that is not the case. Must be >=2 and >= number of categories for any categorical feature Plus, when I use the random forest implementation in H2O <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html>, I do have the option of using less bins that the total number of distinct categorical values. Could anyone explain the reason for this restriction in spark? Is there some kind of particular data preprocessing / feature engineering users are expected to have done beforehand? Am I misunderstanding something about decision trees (eg. is it categorical don't really ever *need* to be binned in the first place and the setting is just for numerical values or something)?