Could you please open a JIRA for it? The maxBins input is missing for the
Python Api.

Is it possible if you can use the current master? In the current master,
you should be able to use trees with the Pipeline Api and DataFrames.

Best,
Burak

On Wed, May 20, 2015 at 2:44 PM, Don Drake <dondr...@gmail.com> wrote:

> I'm running Spark v1.3.1 and when I run the following against my dataset:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3)
>
> The job will fail with the following message:
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>     model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 553, in trainRegressor
>     loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 438, in _train
>     loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 120, in callMLlibFunc
>     return callJavaFunc(sc, api, *args)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 113, in callJavaFunc
>     return _java2py(sc, func(*args))
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o69.trainGradientBoostedTreesModel.
> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
> requires maxBins (= 32) >= max categories in categorical features (= 1895)
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> at
> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>
> So, it's complaining about the maxBins, if I provide maxBins=1900 and
> re-run it:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3, maxBins=1900)
>
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>     model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catF
> eatures, maxDepth=6, numIterations=3, maxBins=1900)
> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>
> It now says it knows nothing of maxBins.
>
> If I run the same command against DecisionTree or RandomForest (with
> maxBins=1900) it works just fine.
>
> Seems like a bug in GradientBoostedTrees.
>
> Suggestions?
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> 800-733-2143
>

Reply via email to