Hi I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS node. I am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator:
I tried the following code, using the dataset given in Spark's CV documentation for cross validator <https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator> . I also pass the DF through a StringIndexer transformation for the RF: Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double. Is it an issue with the evaluator, or is there something else that I'm missing? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-tp23791.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org