Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01?
Best, Burak ----- Original Message ----- From: "Krishna Sankar" <ksanka...@gmail.com> To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) print lrm print lrm.weights print lrm.intercept lrm.predict([40]) output: <pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50> [ 1.] 0.0 40.0 b) By perturbing the y a little bit, the model gives wrong results: data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(9.0, [10.0]), LabeledPoint(22.0, [20.0]), LabeledPoint(32.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be 1.09x -0.60 print lrm print lrm.weights print lrm.intercept lrm.predict([40]) Output: <pyspark.mllib.regression.LinearRegressionModel object at 0x109666590> [ -8.20487463e+203] 0.0 -3.2819498532740317e+205 c) Same story here - wrong results. Actually nan: data = [ LabeledPoint(18.9, [3910.0]), LabeledPoint(17.0, [3860.0]), LabeledPoint(20.0, [4200.0]), LabeledPoint(16.6, [3660.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 print lrm print lrm.weights print lrm.intercept lrm.predict([4000]) Output:<pyspark.mllib.regression.LinearRegressionModel object at 0x109666b90> [ nan] 0.0 nan Cheers & Thanks <k/> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org