I have an issue with an SVM model trained for binary classification using
Spark 2.0.0.
I have followed the same logic using scikit-learn and MLlib, using the exact
same dataset.
For scikit learn I have the following code:

    svc_model = SVC()
    svc_model.fit(X_train, y_train)

    print "supposed to be 1"
    print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
    print
svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
    print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
    print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])
    
    print "supposed to be 0"
    print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
15.0, 15.0])
    print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
    print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0,
7.0, 15.0])
    print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
15.0, 7.0])


and it returns:

    supposed to be 1
    [0]
    [1]
    [1]
    [1]
    supposed to be 0
    [0]
    [0]
    [0]
    [0]

For spark am doing:

    model_svm = SVMWithSGD.train(trainingData, iterations=100)
    
    model_svm.clearThreshold()
    
    print "supposed to be 1"
    print
model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
    print
model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
    print
model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
    print
model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
   
    print "supposed to be 0"
    print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
15.0, 15.0, 15.0, 15.0))
    print
model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
    print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0,
15.0, 18.0, 7.0, 15.0))
    print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
15.0, 15.0, 15.0, 7.0))

which returns:

    supposed to be 1
    12.8250120159
    16.0786937313
    14.2139435305
    16.5115589658
    supposed to be 0
    17.1311777004
    14.075461697
    20.8883372052
    12.9132580999

when I am setting the threshold I am either getting all zeros or all ones.

Does anyone know how to approach this problem?

As I said I have checked multiple times that my dataset and feature
extraction logic are exactly the same in both cases.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-predictions-python-tp28240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to