Well, apparently, the above Python set-up is wrong. Please consider the following set-up which DOES use 'linear' kernel... And the question remains the same: how to interpret Spark results (or why Spark results are NOT bounded between -1 and 1)?
On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri <sunny.k...@gmail.com> wrote: > One diff I can find is you may have different kernel functions for your > training, In Spark, you end up using Linear Kernel whereas for scikit you > are using rbk kernel. That can explain the different in the coefficients > you are getting. > > On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais < > adamantios.cor...@gmail.com> wrote: > >> Hi again, >> >> Finally, I found the time to play around with your suggestions. >> Unfortunately, I noticed some unusual behavior in the MLlib results, which >> is more obvious when I compare them against their scikit-learn equivalent. >> Note that I am currently using spark 0.9.2. Long story short: I find it >> difficult to interpret the result: scikit-learn SVM always returns a value >> between 0 and 1 which makes it easy for me to set-up a threshold in order >> to keep only the most significant classifications (this is the case for >> both short and long input vectors). On the other hand, Spark MLlib makes it >> impossible to interpret the results; results are hardly ever bounded >> between -1 and +1 and hence it is impossible to choose a good cut-off value >> - results are of no practical use. And here is the strangest thing ever: >> although - it seems that - MLlib does NOT generate the right weights and >> intercept, when I feed the MLlib with the weights and intercept from >> scikit-learn the results become pretty accurate!!!! Any ideas about what is >> happening? Any suggestion is highly appreciated. >> >> PS: to make thinks easier I have quoted both of my implantations as well >> as results, bellow. >> >> ////////////////////////////////////////////////// >> >> SPARK (short input): >> training_error: Double = 0.0 >> res2: Array[Double] = Array(-1.4420684459128205E-19, >> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999, >> 0.7499999999999998, 0.7499999999999998, 0.7499999999999998) >> >> SPARK (long input): >> training_error: Double = 0.0 >> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241, >> -0.782207630902241, 0.9522394329769612, 2.6866864968561632, >> 2.6866864968561632, 2.6866864968561632) >> >> PYTHON (short input): >> array([[-1.00000001], >> [-1.00000001], >> [-1.00000001], >> [-0. ], >> [ 1.00000001], >> [ 1.00000001], >> [ 1.00000001]]) >> >> PYTHON (long input): >> array([[-1.00000001], >> [-1.00000001], >> [-1.00000001], >> [-0. ], >> [ 1.00000001], >> [ 1.00000001], >> [ 1.00000001]]) >> >> ////////////////////////////////////////////////// >> >> import analytics.MSC >> >> import java.util.Calendar >> import java.text.SimpleDateFormat >> import scala.collection.mutable >> import scala.collection.JavaConversions._ >> import org.apache.spark.SparkContext._ >> import org.apache.spark.mllib.classification.SVMWithSGD >> import org.apache.spark.mllib.regression.LabeledPoint >> import org.apache.spark.mllib.optimization.L1Updater >> import com.datastax.bdp.spark.connector.CassandraConnector >> import com.datastax.bdp.spark.SparkContextCassandraFunctions._ >> >> val sc = MSC.sc >> val lg = MSC.logger >> >> //val s_users_double_2 = Seq( >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)) >> //) >> val s_users_double_2 = Seq( >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >> (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), >> (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), >> (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)) >> ) >> val s_users_double = sc.parallelize(s_users_double_2) >> >> val s_users_parsed = s_users_double.map{line=> >> LabeledPoint(line._1, line._2.toArray) >> }.cache() >> >> val iterations = 100 >> >> val model = SVMWithSGD.train(s_users_parsed, iterations) >> >> val predictions1 = s_users_parsed.map{point=> >> (point.label, model.predict(point.features)) >> }.cache() >> >> val training_error = predictions1.filter(r=> r._1 != >> r._2).count().toDouble / s_users_parsed.count() >> >> val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else >> false).filter(t=> t).count() >> val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else >> false).filter(t=> t).count() >> val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else >> false).filter(t=> t).count() >> val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else >> false).filter(t=> t).count() >> >> val weights = model.weights >> >> val intercept = model.intercept >> >> //val m_users_double_2 = Seq( >> // Seq(0.0, 0.0, 0.0), >> // Seq(0.0, 0.0, 0.0), >> // Seq(0.0, 0.0, 0.0), >> // Seq(0.5, 0.5, 0.5), >> // Seq(1.0, 1.0, 1.0), >> // Seq(1.0, 1.0, 1.0), >> // Seq(1.0, 1.0, 1.0) >> //) >> val m_users_double_2 = Seq( >> Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0), >> Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0), >> Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0), >> Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, >> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, >> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5), >> Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0), >> Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0), >> Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0) >> ) >> val m_users_double = sc.parallelize(m_users_double_2) >> >> val predictions2 = m_users_double.map{point=> >> point.zip(weights).map(a=> a._1 * a._2).sum + intercept >> }.cache() >> >> predictions2.collect() >> >> ////////////////////////////////////////////////// >> >> from sklearn import svm >> >> flag = 'short' # 'long' >> >> if flag == 'long': >> X = [ >> [0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0], >> [1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0] >> ] >> Y = [ >> 0.0, >> 0.0, >> 0.0, >> 1.0, >> 1.0, >> 1.0 >> ] >> T = [ >> [0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0], >> [0.5, 0.5, 0.5], >> [1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0] >> ] >> >> if flag == 'long': >> X = [ >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0] >> ] >> Y = [ >> 0.0, >> 0.0, >> 0.0, >> 1.0, >> 1.0, >> 1.0 >> ] >> T = [ >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0], >> [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, >> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, >> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], >> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, >> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0] >> ] >> >> clf = svm.SVC() >> clf.fit(X, Y) >> svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, >> gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, >> shrinking=True, tol=0.001, verbose=False) >> clf.decision_function(T) >> >> /////////////////////////////////////////////////// >> >> >> >> >> On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <sunny.k...@gmail.com> >> wrote: >> >>> For multi-class you can use the same SVMWithSGD (for binary >>> classification) with One-vs-All approach constructing respective training >>> corpuses consisting one Class i as positive samples and Rest of the classes >>> as negative one, and then use the same method provided by Aris as a measure >>> of how far Class i is from the decision boundary. >>> >>> On Wed, Sep 24, 2014 at 4:06 PM, Aris <arisofala...@gmail.com> wrote: >>> >>>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου.. >>>> >>>> Just to follow up on Liquan, you might be interested in removing the >>>> thresholds, and then treating the predictions as a probability from 0..1 >>>> inclusive. SVM with the linear kernel is a straightforward linear >>>> classifier -- so you with the model.clearThreshold() you can just get the >>>> raw predicted scores, removing the threshold which simple translates that >>>> into a positive/negative class. >>>> >>>> API is here >>>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel >>>> >>>> Enjoy! >>>> Aris >>>> >>>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <liquan...@gmail.com> >>>> wrote: >>>> >>>>> HI Adamantios, >>>>> >>>>> For your first question, after you train the SVM, you get a model with >>>>> a vector of weights w and an intercept b, point x such that w.dot(x) + b >>>>> = >>>>> 1 and w.dot(x) + b = -1 are points that on the decision boundary. The >>>>> quantity w.dot(x) + b for point x is a confidence measure of >>>>> classification. >>>>> >>>>> Code wise, suppose you trained your model via >>>>> val model = SVMWithSGD.train(...) >>>>> >>>>> and you can set a threshold by calling >>>>> >>>>> model.setThreshold(your threshold here) >>>>> >>>>> to set the threshold that separate positive predictions from negative >>>>> predictions. >>>>> >>>>> For more info, please take a look at >>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel >>>>> >>>>> For your second question, SVMWithSGD only supports binary >>>>> classification. >>>>> >>>>> Hope this helps, >>>>> >>>>> Liquan >>>>> >>>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais < >>>>> adamantios.cor...@gmail.com> wrote: >>>>> >>>>>> Nobody? >>>>>> >>>>>> If that's not supported already, can please, at least, give me a few >>>>>> hints on how to implement it? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais < >>>>>> adamantios.cor...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am working with the SVMWithSGD classification algorithm on Spark. >>>>>>> It works fine for me, however, I would like to recognize the instances >>>>>>> that >>>>>>> are classified with a high confidence from those with a low one. How do >>>>>>> we >>>>>>> define the threshold here? Ultimately, I want to keep only those for >>>>>>> which >>>>>>> the algorithm is very *very* certain about its its decision! How to do >>>>>>> that? Is this feature supported already by any MLlib algorithm? What if >>>>>>> I >>>>>>> had multiple categories? >>>>>>> >>>>>>> Any input is highly appreciated! >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Liquan Pei >>>>> Department of Physics >>>>> University of Massachusetts Amherst >>>>> >>>> >>>> >>> >> >