regParam=1.0 may penalize too much, because we use the average loss
instead of total loss. I just sent a PR to lower the default:
https://github.com/apache/spark/pull/3232
You can try LogisticRegressionWithLBFGS (and configure parameters
through its optimizer), which should converge faster than SG
OK, it's not class imbalance. Yes, 100 iterations.
My other guess is that the stepSize of 1 is way too big for your data.
I'd suggest you look at the weights / intercept of the resulting model to
see if it makes any sense.
You can call clearThreshold on the model, and then it will 'predict' the
S
Sean,
Thanks a lot for your reply!
A few follow up questions:
1. numIterations should be 100, not 100*trainingSetSize, right?
2. My training set has 90k positive data points (with label 1) and 60k
negative data points (with label 0).
I set my numIterations to 100 as default. I still got the same
I think you need to use setIntercept(true) to get it to allow a non-zero
intercept. I also kind of agree that's not obvious or the intuitive default.
Is your data set highly imbalanced, with lots of positive examples? that
could explain why predictions are heavily skewed.
Iterations should defini