I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code:
svc_model = SVC() svc_model.fit(X_train, y_train) print "supposed to be 1" print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7]) print svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0]) print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0]) print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0]) print "supposed to be 0" print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0]) print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0]) print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0]) print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0]) and it returns: supposed to be 1 [0] [1] [1] [1] supposed to be 0 [0] [0] [0] [0] For spark am doing: model_svm = SVMWithSGD.train(trainingData, iterations=100) model_svm.clearThreshold() print "supposed to be 1" print model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0)) print model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0)) print model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0)) print model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0)) print "supposed to be 0" print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0, 15.0, 15.0)) print model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0)) print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0, 7.0, 15.0)) print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0, 15.0, 7.0)) which returns: supposed to be 1 12.8250120159 16.0786937313 14.2139435305 16.5115589658 supposed to be 0 17.1311777004 14.075461697 20.8883372052 12.9132580999 when I am setting the threshold I am either getting all zeros or all ones. Does anyone know how to approach this problem? As I said I have checked multiple times that my dataset and feature extraction logic are exactly the same in both cases. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-predictions-python-tp28240.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org