Spark issue with running CrossValidator with RandomForestClassifier on dataset

shivamverma Mon, 13 Jul 2015 02:45:31 -0700

Hi

I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS
node. I am trying to run grid search on an RF classifier to classify a small
dataset using the pyspark.ml.tuning module, specifically the
ParamGridBuilder and CrossValidator classes. I get the following error when
I try passing a DataFrame of Features-Labels to CrossValidator:




I tried the following code, using the dataset given in Spark's CV
documentation for  cross validator
<https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator>
 
. I also pass the DF through a StringIndexer transformation for the RF:



Note that the above dataset works on logistic regression. I have also tried
a larger dataset with sparse vectors as features (which I was originally
trying to fit) but received the same error on RF.
My guess is that there is an issue with how
BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction",
labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict'
column - with LR, the rawPredictionCol is a list/vector, whereas with RF,
the prediction column is a double.
Is it an issue with the evaluator, or is there something else that I'm
missing?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-tp23791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark issue with running CrossValidator with RandomForestClassifier on dataset

Reply via email to