Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

DB Tsai Fri, 22 May 2015 12:46:42 -0700

Great to see the result comparable to R in new ML implementation.
Since majority of users will still use the old mllib api, we plan to
call the ML implementation from MLlib to handle the intercept
correctly with regularization.


JIRA is created.
https://issues.apache.org/jira/browse/SPARK-7780

Sincerely,

DB Tsai
-------------------------------------------------------
Blog: https://www.dbtsai.com


On Fri, May 22, 2015 at 10:45 AM, Xin Liu <liuxin...@gmail.com> wrote:
> Thank you guys for the prompt help.
>
> I ended up building spark master and verified what DB has suggested.
>
> val lr = (new MlLogisticRegression)
>        .setFitIntercept(true)
>        .setMaxIter(35)
>
>      val model = lr.fit(sqlContext.createDataFrame(training))
>      val scoreAndLabels = model.transform(sqlContext.createDataFrame(test))
>        .select("probability", "label")
>        .map { case Row(probability: Vector, label: Double) =>
>          (probability(1), label)
>        }
>
> Without doing much tuning, above generates
>
> Weights: [0.0013971323020715888,0.8559779783186241,-0.5052275562089914]
> Intercept: -3.3076806966913006
> Area under ROC: 0.7033511043412033
>
> I also tried it on a much bigger dataset I have and its results are close to
> what I get from statsmodel.
>
> Now early waiting for the 1.4 release.
>
> Thanks,
> Xin
>
>
>
> On Wed, May 20, 2015 at 9:37 PM, Chris Gore <cdg...@cdgore.com> wrote:
>>
>> I tried running this data set as described with my own implementation of
>> L2 regularized logistic regression using LBFGS to compare:
>> https://github.com/cdgore/fitbox
>>
>> Intercept: -0.886745823033
>> Weights (['gre', 'gpa', 'rank']):[ 0.28862268  0.19402388 -0.36637964]
>> Area under ROC: 0.724056603774
>>
>> The difference could be from the feature preprocessing as mentioned.  I
>> normalized the features around 0:
>>
>> binary_train_normalized = (binary_train - binary_train.mean()) /
>> binary_train.std()
>> binary_test_normalized = (binary_test - binary_train.mean()) /
>> binary_train.std()
>>
>> On a data set this small, the difference in models could also be the
>> result of how the training/test sets were split.
>>
>> Have you tried running k-folds cross validation on a larger data set?
>>
>> Chris
>>
>> On May 20, 2015, at 6:15 PM, DB Tsai <d...@netflix.com.INVALID> wrote:
>>
>> Hi Xin,
>>
>> If you take a look at the model you trained, the intercept from Spark
>> is significantly smaller than StatsModel, and the intercept represents
>> a prior on categories in LOR which causes the low accuracy in Spark
>> implementation. In LogisticRegressionWithLBFGS, the intercept is
>> regularized due to the implementation of Updater, and the intercept
>> should not be regularized.
>>
>> In the new pipleline APIs, a LOR with elasticNet is implemented, and
>> the intercept is properly handled.
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
>>
>> As you can see the tests,
>>
>> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
>> the result is exactly the same as R now.
>>
>> BTW, in both version, the feature scalings are done before training,
>> and we train the model in scaled space but transform the model weights
>> back to original space. The only difference is in the mllib version,
>> LogisticRegressionWithLBFGS regularizes the intercept while in the ml
>> version, the intercept is excluded from regularization. As a result,
>> if lambda is zero, the model should be the same.
>>
>>
>>
>> On Wed, May 20, 2015 at 3:42 PM, Xin Liu <liuxin...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have tried a few models in Mllib to train a LogisticRegression model.
>> However, I consistently get much better results using other libraries such
>> as statsmodel (which gives similar results as R) in terms of AUC. For
>> illustration purpose, I used a small data (I have tried much bigger data)
>> http://www.ats.ucla.edu/stat/data/binary.csv in
>> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>>
>> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>>
>> val algorithm = new LogisticRegressionWithLBFGS
>>     algorithm.setIntercept(true)
>>     algorithm.optimizer
>>       .setNumIterations(100)
>>       .setRegParam(0.01)
>>       .setConvergenceTol(1e-5)
>>     val model = algorithm.run(training)
>>     model.clearThreshold()
>>     val scoreAndLabels = test.map { point =>
>>       val score = model.predict(point.features)
>>       (score, point.label)
>>     }
>>     val metrics = new BinaryClassificationMetrics(scoreAndLabels)
>>     val auROC = metrics.areaUnderROC()
>>
>> I did a (0.6, 0.4) split for training/test. The response is "admit" and
>> features are "GRE score", "GPA", and "college Rank".
>>
>> Spark:
>> Weights (GRE, GPA, Rank):
>> [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
>> Intercept: -0.6488972641282202
>> Area under ROC: 0.6294070512820512
>>
>> StatsModel:
>> Weights [0.0018, 0.7220, -0.3148]
>> Intercept: -3.5913
>> Area under ROC: 0.69
>>
>> The weights from statsmodel seems more reasonable if you consider for a
>> one
>> unit increase in gpa, the log odds of being admitted to graduate school
>> increases by 0.72 in statsmodel than 0.04 in Spark.
>>
>> I have seen much bigger difference with other data. So my question is has
>> anyone compared the results with other libraries and is anything wrong
>> with
>> my code to invoke LogisticRegressionWithLBFGS?
>>
>> As the real data I am processing is pretty big and really want to use
>> Spark
>> to get this to work. Please let me know if you have similar experience and
>> how you resolve it.
>>
>> Thanks,
>> Xin
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

Reply via email to