Hi Xin, 2 suggestions:
1) Feature scaling: spark.mllib's LogisticRegressionWithLBFGS uses feature scaling, which scales feature values to have unit standard deviation. That improves optimization behavior, and it often improves statistical estimation (though maybe not for your dataset). However, it effectively changes the model being learned, so you should expect different results from other libraries like R. You could instead use LogisticRegressionWithSGD, which does not do feature scaling. With SGD, you may need to play around with the stepSize more to get it to converge, but it should be able to learn exactly the same model as R. 2) Convergence: I'd do a sanity check and make sure the algorithm is converging. (Compare with running for more iterations or using a lower convergenceTol.) Note: If you can use the Spark master branch (or wait for Spark 1.4), then the spark.ml Pipelines API will be a good option. It now has LogisticRegression which does not do feature scaling, and it uses LBFGS or OWLQN (depending on the regularization type) for optimization. It's also been compared with R in unit tests. Good luck! Joseph On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote: > Hi, > > I have tried a few models in Mllib to train a LogisticRegression model. > However, I consistently get much better results using other libraries such > as statsmodel (which gives similar results as R) in terms of AUC. For > illustration purpose, I used a small data (I have tried much bigger data) > http://www.ats.ucla.edu/stat/data/binary.csv in > http://www.ats.ucla.edu/stat/r/dae/logit.htm > > Here is the snippet of my usage of LogisticRegressionWithLBFGS. > > val algorithm = new LogisticRegressionWithLBFGS > algorithm.setIntercept(true) > algorithm.optimizer > .setNumIterations(100) > .setRegParam(0.01) > .setConvergenceTol(1e-5) > val model = algorithm.run(training) > model.clearThreshold() > val scoreAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > val metrics = new BinaryClassificationMetrics(scoreAndLabels) > val auROC = metrics.areaUnderROC() > > I did a (0.6, 0.4) split for training/test. The response is "admit" and > features are "GRE score", "GPA", and "college Rank". > > Spark: > Weights (GRE, GPA, Rank): > [0.0011576276331509304,0.048544858567336854,-0.394202150286076] > Intercept: -0.6488972641282202 > Area under ROC: 0.6294070512820512 > > StatsModel: > Weights [0.0018, 0.7220, -0.3148] > Intercept: -3.5913 > Area under ROC: 0.69 > > The weights from statsmodel seems more reasonable if you consider for a > one unit increase in gpa, the log odds of being admitted to graduate school > increases by 0.72 in statsmodel than 0.04 in Spark. > > I have seen much bigger difference with other data. So my question is has > anyone compared the results with other libraries and is anything wrong with > my code to invoke LogisticRegressionWithLBFGS? > > As the real data I am processing is pretty big and really want to use > Spark to get this to work. Please let me know if you have similar > experience and how you resolve it. > > Thanks, > Xin >