Guys,
   Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(10.0, [10.0]),
   LabeledPoint(20.0, [20.0]),
   LabeledPoint(30.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0]))
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

output:
<pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50>

[ 1.]
0.0

40.0

b) By perturbing the y a little bit, the model gives wrong results:

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(9.0, [10.0]),
   LabeledPoint(22.0, [20.0]),
   LabeledPoint(32.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be 1.09x -0.60
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

Output:
<pyspark.mllib.regression.LinearRegressionModel object at 0x109666590>

[ -8.20487463e+203]
0.0

-3.2819498532740317e+205

c) Same story here - wrong results. Actually nan:

data = [
   LabeledPoint(18.9, [3910.0]),
   LabeledPoint(17.0, [3860.0]),
   LabeledPoint(20.0, [4200.0]),
   LabeledPoint(16.6, [3660.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([4000])

Output:<pyspark.mllib.regression.LinearRegressionModel object at
0x109666b90>

[ nan]
0.0

nan

Cheers & Thanks
<k/>

Reply via email to