Re: SVMWithSGD default threshold

2014-11-12 Thread Xiangrui Meng
regParam=1.0 may penalize too much, because we use the average loss instead of total loss. I just sent a PR to lower the default: https://github.com/apache/spark/pull/3232 You can try LogisticRegressionWithLBFGS (and configure parameters through its optimizer), which should converge faster than SG

Re: SVMWithSGD default threshold

2014-11-12 Thread Sean Owen
OK, it's not class imbalance. Yes, 100 iterations. My other guess is that the stepSize of 1 is way too big for your data. I'd suggest you look at the weights / intercept of the resulting model to see if it makes any sense. You can call clearThreshold on the model, and then it will 'predict' the S

Re: SVMWithSGD default threshold

2014-11-12 Thread Caron
Sean, Thanks a lot for your reply! A few follow up questions: 1. numIterations should be 100, not 100*trainingSetSize, right? 2. My training set has 90k positive data points (with label 1) and 60k negative data points (with label 0). I set my numIterations to 100 as default. I still got the same

Re: SVMWithSGD default threshold

2014-11-11 Thread Sean Owen
I think you need to use setIntercept(true) to get it to allow a non-zero intercept. I also kind of agree that's not obvious or the intuitive default. Is your data set highly imbalanced, with lots of positive examples? that could explain why predictions are heavily skewed. Iterations should defini