Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

Joseph Bradley Wed, 20 May 2015 18:00:51 -0700

Hi Xin,

2 suggestions:

1) Feature scaling: spark.mllib's LogisticRegressionWithLBFGS uses feature
scaling, which scales feature values to have unit standard deviation.  That
improves optimization behavior, and it often improves statistical
estimation (though maybe not for your dataset).  However, it effectively
changes the model being learned, so you should expect different results
from other libraries like R.  You could instead use LogisticRegressionWithSGD,
which does not do feature scaling.  With SGD, you may need to play around
with the stepSize more to get it to converge, but it should be able to
learn exactly the same model as R.

2) Convergence: I'd do a sanity check and make sure the algorithm is
converging.  (Compare with running for more iterations or using a lower
convergenceTol.)

Note: If you can use the Spark master branch (or wait for Spark 1.4), then
the spark.ml Pipelines API will be a good option.  It now has
LogisticRegression which does not do feature scaling, and it uses LBFGS or
OWLQN (depending on the regularization type) for optimization.  It's also
been compared with R in unit tests.

Good luck!
Joseph

On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote:

> Hi,
>
> I have tried a few models in Mllib to train a LogisticRegression model.
> However, I consistently get much better results using other libraries such
> as statsmodel (which gives similar results as R) in terms of AUC. For
> illustration purpose, I used a small data (I have tried much bigger data)
>  http://www.ats.ucla.edu/stat/data/binary.csv in
> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>
> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>
> val algorithm = new LogisticRegressionWithLBFGS
>      algorithm.setIntercept(true)
>      algorithm.optimizer
>        .setNumIterations(100)
>        .setRegParam(0.01)
>        .setConvergenceTol(1e-5)
>      val model = algorithm.run(training)
>      model.clearThreshold()
>      val scoreAndLabels = test.map { point =>
>        val score = model.predict(point.features)
>        (score, point.label)
>      }
>      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
>      val auROC = metrics.areaUnderROC()
>
> I did a (0.6, 0.4) split for training/test. The response is "admit" and
> features are "GRE score", "GPA", and "college Rank".
>
> Spark:
> Weights (GRE, GPA, Rank):
> [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
> Intercept: -0.6488972641282202
> Area under ROC: 0.6294070512820512
>
> StatsModel:
> Weights [0.0018, 0.7220, -0.3148]
> Intercept: -3.5913
> Area under ROC: 0.69
>
> The weights from statsmodel seems more reasonable if you consider for a
> one unit increase in gpa, the log odds of being admitted to graduate school
> increases by 0.72 in statsmodel than 0.04 in Spark.
>
> I have seen much bigger difference with other data. So my question is has
> anyone compared the results with other libraries and is anything wrong with
> my code to invoke LogisticRegressionWithLBFGS?
>
> As the real data I am processing is pretty big and really want to use
> Spark to get this to work. Please let me know if you have similar
> experience and how you resolve it.
>
> Thanks,
> Xin
>

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

Reply via email to