Hi all,

This is just getting stranger… After playing a while, it seems that if I have a 
vector that has value of 0 (i.e. all zero’s) it classifies it as -1.0. Any 
other value for the vector causes it to classify as 1.0:

============================
=== Predictions
============================
(DenseVector(0.0, 0.0, 0.0),-1.0)
(DenseVector(0.0, 0.5, 0.0),1.0)
(DenseVector(1.0, 1.0, 1.0),1.0)
(DenseVector(0.0, 0.0, 0.0),-1.0)
(DenseVector(0.0, 0.5, 1.0),1.0)

So it seems that my values need to be binary for this prediction to work, which 
of course does not make sense and doesn’t match the data from the example on 
the Flink website. It gives me the impression that it is using the vector as 
the label instead of the value…

Any insights?

— Mano

On 25 Jun 2018, at 11:40, Mano Swerts 
<mano.swe...@ixxus.com<mailto:mano.swe...@ixxus.com>> wrote:

Hi Rong,

As you can see in my test data example, I did change the labeling data to 8 and 
16 instead of 1 and 0.

If SVM always returns +1.0 or -1.0, that would then indeed explain where the 
1.0 is coming from. But, it never gives me -1.0, so there is still something 
wrong as it classifies everything under the same label.

Thanks.

— Mano

On 23 Jun 2018, at 20:50, Rong Rong 
<walter...@gmail.com<mailto:walter...@gmail.com>> wrote:

Hi Mano,

For the always positive prediction result. I think the standard svmguide
data [1] is labeling data as 0.0 and 1.0 instead of -1.0 and +1.0. Maybe
correcting that should work for your case.
For the change of eval pairs, I think SVM in FlinkML will always return
a +1.0 or -1.0 when you use it this way as a binary classification.

Thanks,
Rong

[1] https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1

On Fri, Jun 22, 2018 at 6:49 AM Mano Swerts 
<mano.swe...@ixxus.com<mailto:mano.swe...@ixxus.com>> wrote:

Hi guys,

Here I am again. I am playing with Flink ML and was just trying to get the
example to work used in the documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/libs/ml/quickstart.html#loading-data
(the one using the astroparticle LibSVM data).

My code is basically what you see in the example, with some more output
for verification:


object LearnDocumentEntityRelationship {

  val trainingDataPath = “/data/svmguide1.training.txt"
  val testDataPath = “/data/svmguide1.test.txt"

  def main(args: Array[String]) {
      val env = ExecutionEnvironment.getExecutionEnvironment

      val trainingData: DataSet[LabeledVector] = MLUtils.readLibSVM(env,
trainingDataPath)

      println("============================")
      println("=== Training Data")
      println("============================")
      trainingData.print()

      val testData = MLUtils.readLibSVM(env, testDataPath).map(x =>
(x.vector, x.label))

      println("============================")
      println("=== Test Data")
      println("============================")
      testData.print()

      val svm = SVM()
          .setBlocks(env.getParallelism)
          .setIterations(100)
          .setRegularization(0.001)
          .setStepsize(0.1)
          .setSeed(42)

      svm.fit(trainingData)

      val evaluationPairs: DataSet[(Double, Double)] =
svm.evaluate(testData)

      println("============================")
      println("=== Evaluation Pairs")
      println("============================")
      evaluationPairs.print()

      val realData = MLUtils.readLibSVM(env, testDataPath).map(x =>
x.vector)

      var predictionDS = svm.predict(realData)

      println("============================")
      println("=== Predictions")
      println("============================")
      predictionDS.print()

      println("=== End")

      env.execute("Learn Document Entity Relationship Job")
  }
}


The issue is that the predictions (from both the evaluation pairs and the
prediction dataset) are always equal to “1.0”. When I changed the labels in
the data files to 16 and 8 (so 1 is not a valid label anymore) it still
keeps predicting “1.0” for every single record. I also tried with some
other custom datasets, but I always get that same result.

This is a concise part of the output (as the data contains to many records
to put here):

============================
=== Test Data
============================
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
(3,97.52163)),16.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797),
(3,97.52163)),16.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),8.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.166669), (3,179.9808)),8.0)

============================
=== Evaluation Pairs
============================
(16.0,1.0)
(16.0,1.0)
(8.0,1.0)
(8.0,1.0)

============================
=== Predictions
============================
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,4.236298), (1,21.9821), (2,-0.3503797), (3,97.52163)),1.0)
(SparseVector((0,77.948), (1,193.678), (2,0.1584834), (3,122.2632)),1.0)
(SparseVector((0,50.24301), (1,312.111), (2,-0.166669), (3,179.9808)),1.0)


Am I doing something wrong?

Any pointers are greatly appreciated. Thanks!

— Mano



Reply via email to