Hi,

It appears that the step size is too high that the model is diverging with the 
added noise. 
Could you try by setting the step size to be 0.1 or 0.01?

Best,
Burak

----- Original Message -----
From: "Krishna Sankar" <ksanka...@gmail.com>
To: user@spark.apache.org
Sent: Wednesday, October 1, 2014 12:43:20 PM
Subject: MLlib Linear Regression Mismatch

Guys,
   Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(10.0, [10.0]),
   LabeledPoint(20.0, [20.0]),
   LabeledPoint(30.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0]))
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

output:
<pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50>

[ 1.]
0.0

40.0

b) By perturbing the y a little bit, the model gives wrong results:

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(9.0, [10.0]),
   LabeledPoint(22.0, [20.0]),
   LabeledPoint(32.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be 1.09x -0.60
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

Output:
<pyspark.mllib.regression.LinearRegressionModel object at 0x109666590>

[ -8.20487463e+203]
0.0

-3.2819498532740317e+205

c) Same story here - wrong results. Actually nan:

data = [
   LabeledPoint(18.9, [3910.0]),
   LabeledPoint(17.0, [3860.0]),
   LabeledPoint(20.0, [4200.0]),
   LabeledPoint(16.6, [3660.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([4000])

Output:<pyspark.mllib.regression.LinearRegressionModel object at
0x109666b90>

[ nan]
0.0

nan

Cheers & Thanks
<k/>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to