Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works :
data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) print lrm print lrm.weights print lrm.intercept lrm.predict([40]) output: <pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50> [ 1.] 0.0 40.0 b) By perturbing the y a little bit, the model gives wrong results: data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(9.0, [10.0]), LabeledPoint(22.0, [20.0]), LabeledPoint(32.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be 1.09x -0.60 print lrm print lrm.weights print lrm.intercept lrm.predict([40]) Output: <pyspark.mllib.regression.LinearRegressionModel object at 0x109666590> [ -8.20487463e+203] 0.0 -3.2819498532740317e+205 c) Same story here - wrong results. Actually nan: data = [ LabeledPoint(18.9, [3910.0]), LabeledPoint(17.0, [3860.0]), LabeledPoint(20.0, [4200.0]), LabeledPoint(16.6, [3660.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 print lrm print lrm.weights print lrm.intercept lrm.predict([4000]) Output:<pyspark.mllib.regression.LinearRegressionModel object at 0x109666b90> [ nan] 0.0 nan Cheers & Thanks <k/>