I have an issue with an SVM model trained for binary classification using
Spark 2.0.0.
I have followed the same logic using scikit-learn and MLlib, using the exact
same dataset.
For scikit learn I have the following code:
svc_model = SVC()
svc_model.fit(X_train, y_train)
print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print
svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])
print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0,
7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
15.0, 7.0])
and it returns:
supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]
For spark am doing:
model_svm = SVMWithSGD.train(trainingData, iterations=100)
model_svm.clearThreshold()
print "supposed to be 1"
print
model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print
model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
15.0, 15.0, 15.0, 15.0))
print
model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0,
15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
15.0, 15.0, 15.0, 7.0))
which returns:
supposed to be 1
12.8250120159
16.0786937313
14.2139435305
16.5115589658
supposed to be 0
17.1311777004
14.075461697
20.8883372052
12.9132580999
when I am setting the threshold I am either getting all zeros or all ones.
Does anyone know how to approach this problem?
As I said I have checked multiple times that my dataset and feature
extraction logic are exactly the same in both cases.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-predictions-python-tp28240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]