Thanks Sean. Our training MSE is really large. We definitely need better predictor variables.
Training Mean Squared Error = 7.72E8 Thanks, Manish On Mon, Mar 6, 2017 at 4:45 PM, Sean Owen <so...@cloudera.com> wrote: > There's nothing unusual about negative values from a linear regression. > If, generally, your predicted values are far from your actual values, then > your model hasn't fit well. You may have a bug somewhere in your pipeline > or you may have data without much linear relationship. Most of this isn't a > Spark problem. > > On Mon, Mar 6, 2017 at 8:05 AM Manish Maheshwari <mylogi...@gmail.com> > wrote: > >> Hi All, >> >> We are using a LinearRegressionModel in Scala. We are using a standard >> StandardScaler to normalize the data before modelling.. the Code snippet >> looks like this - >> >> *Modellng - * >> val labeledPointsRDD = tableRecords.map(row => >> { >> val filtered = row.toSeq.filter({ case s: String => false case _ => true >> }) >> val converted = filtered.map({ case i: Int => i.toDouble case l: Long => >> l.toDouble case d: Double => d case _ => 0.0 }) >> val features = Vectors.dense(converted.slice(1, >> converted.length).toArray) >> LabeledPoint(converted(0), features) >> }) >> val scaler1 = new StandardScaler().fit(labeledPointsRDD.map(x => >> x.features)) >> save(sc, scalarModelOutputPath, scaler1) >> val normalizedData = labeledPointsRDD.map(lp => {LabeledPoint(lp.label, >> scaler1.transform(lp.features))}) >> val splits = normalizedData.randomSplit(Array(0.8, 0.2)) >> val trainingData = splits(0) >> val testingData = splits(1) >> trainingData.cache() >> var regression = new LinearRegressionWithSGD().setIntercept(true) >> regression.optimizer.setStepSize(0.01) >> val model = regression.run(trainingData) >> model.save(sc, modelOutputPath) >> >> Post that when we score the model on the same data that it was trained on >> using the below snippet we see this - >> >> *Scoring - * >> val labeledPointsRDD = tableRecords.map(row => >> {val filtered = row.toSeq.filter({ case s: String => false case _ => true >> }) >> val converted = filtered.map({ case i: Int => i.toDouble case l: Long => >> l.toDouble case d: Double => d case _ => 0.0 }) >> val features = Vectors.dense(converted.toArray) >> (row(0), features) >> }) >> val scaler1 = read(sc,scalarModelOutputPath) >> val normalizedData = labeledPointsRDD.map(p => (p._1, >> scaler1.transform(p._2))) >> normalizedData.cache() >> val model = LinearRegressionModel.load(sc,modelOutputPath) >> val valuesAndPreds = normalizedData.map(p => (p._1.toString(), >> model.predict(p._2))) >> >> However, a lot of predicted values are negative. The input data has no >> negative values we we are unable to understand this behaviour. >> Further the order and sequence of all the variables remains the same in >> the modelling and testing data frames. >> >> Any ideas? >> >> Thanks, >> Manish >> >>