There's nothing unusual about negative values from a linear regression. If, generally, your predicted values are far from your actual values, then your model hasn't fit well. You may have a bug somewhere in your pipeline or you may have data without much linear relationship. Most of this isn't a Spark problem.
On Mon, Mar 6, 2017 at 8:05 AM Manish Maheshwari <mylogi...@gmail.com> wrote: > Hi All, > > We are using a LinearRegressionModel in Scala. We are using a standard > StandardScaler to normalize the data before modelling.. the Code snippet > looks like this - > > *Modellng - * > val labeledPointsRDD = tableRecords.map(row => > { > val filtered = row.toSeq.filter({ case s: String => false case _ => true }) > val converted = filtered.map({ case i: Int => i.toDouble case l: Long => > l.toDouble case d: Double => d case _ => 0.0 }) > val features = Vectors.dense(converted.slice(1, converted.length).toArray) > LabeledPoint(converted(0), features) > }) > val scaler1 = new StandardScaler().fit(labeledPointsRDD.map(x => > x.features)) > save(sc, scalarModelOutputPath, scaler1) > val normalizedData = labeledPointsRDD.map(lp => {LabeledPoint(lp.label, > scaler1.transform(lp.features))}) > val splits = normalizedData.randomSplit(Array(0.8, 0.2)) > val trainingData = splits(0) > val testingData = splits(1) > trainingData.cache() > var regression = new LinearRegressionWithSGD().setIntercept(true) > regression.optimizer.setStepSize(0.01) > val model = regression.run(trainingData) > model.save(sc, modelOutputPath) > > Post that when we score the model on the same data that it was trained on > using the below snippet we see this - > > *Scoring - * > val labeledPointsRDD = tableRecords.map(row => > {val filtered = row.toSeq.filter({ case s: String => false case _ => true > }) > val converted = filtered.map({ case i: Int => i.toDouble case l: Long => > l.toDouble case d: Double => d case _ => 0.0 }) > val features = Vectors.dense(converted.toArray) > (row(0), features) > }) > val scaler1 = read(sc,scalarModelOutputPath) > val normalizedData = labeledPointsRDD.map(p => (p._1, > scaler1.transform(p._2))) > normalizedData.cache() > val model = LinearRegressionModel.load(sc,modelOutputPath) > val valuesAndPreds = normalizedData.map(p => (p._1.toString(), > model.predict(p._2))) > > However, a lot of predicted values are negative. The input data has no > negative values we we are unable to understand this behaviour. > Further the order and sequence of all the variables remains the same in > the modelling and testing data frames. > > Any ideas? > > Thanks, > Manish > >