Thanks Burak. Step size 0.01 worked for b) and step=0.00000001 for c) ! Cheers <k/>
On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz <bya...@stanford.edu> wrote: > Hi, > > It appears that the step size is too high that the model is diverging with > the added noise. > Could you try by setting the step size to be 0.1 or 0.01? > > Best, > Burak > > ----- Original Message ----- > From: "Krishna Sankar" <ksanka...@gmail.com> > To: user@spark.apache.org > Sent: Wednesday, October 1, 2014 12:43:20 PM > Subject: MLlib Linear Regression Mismatch > > Guys, > Obviously I am doing something wrong. May be 4 points are too small a > dataset. > Can you help me to figure out why the following doesn't work ? > a) This works : > > data = [ > LabeledPoint(0.0, [0.0]), > LabeledPoint(10.0, [10.0]), > LabeledPoint(20.0, [20.0]), > LabeledPoint(30.0, [30.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([40]) > > output: > <pyspark.mllib.regression.LinearRegressionModel object at 0x109813d50> > > [ 1.] > 0.0 > > 40.0 > > b) By perturbing the y a little bit, the model gives wrong results: > > data = [ > LabeledPoint(0.0, [0.0]), > LabeledPoint(9.0, [10.0]), > LabeledPoint(22.0, [20.0]), > LabeledPoint(32.0, [30.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) # should be 1.09x -0.60 > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([40]) > > Output: > <pyspark.mllib.regression.LinearRegressionModel object at 0x109666590> > > [ -8.20487463e+203] > 0.0 > > -3.2819498532740317e+205 > > c) Same story here - wrong results. Actually nan: > > data = [ > LabeledPoint(18.9, [3910.0]), > LabeledPoint(17.0, [3860.0]), > LabeledPoint(20.0, [4200.0]), > LabeledPoint(16.6, [3660.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([4000]) > > Output:<pyspark.mllib.regression.LinearRegressionModel object at > 0x109666b90> > > [ nan] > 0.0 > > nan > > Cheers & Thanks > <k/> > >