Re: [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

Shaocun Tian Wed, 26 Nov 2014 19:00:14 -0800

Hi, all

As I understand, with feature scaling the optimization algorithm will
converge faster.  Here I have a question about doing scaling multi times. I
know that doing more standard scaling will cause no difference. But if I
want to try MinMax scaling, would it be weird to using standard scaling
again before training?


Another small issue we met in Spark 1.0 is that loss computation is
LogisticGradient might overflow (e.g. log1p(exp(margin)) might be
infinity). This issue disappears when feature is scaled. So if we decide to
accept unscaled input data, it should be handled properly. We have been
using softmax instead log1p to fix it.

btw, I can contribute our MinMaxScaler implementation if it is useful for
others.

Best,
Shaocun Tian

On Thu, Nov 27, 2014 at 8:08 AM, DB Tsai-3 [via Apache Spark Developers
List] <ml-node+s1001551n953...@n3.nabble.com> wrote:

> Hi Yanbo,
>
> As Xiangrui said, the feature scaling in training step is transparent
> to users, and
> in theory, with/without feature scaling, the optimization should
> converge to the same
> solution after transforming to the original space.
>
> In short, we do the training in the scaled space, and get the weights
> in the scaled space.
> Then we transform the weights to the original space so it's
> transparent to users.
>
> GLMNET package in R does the same thing, and I think we should do it
> instead of asking
> users to do it using pipeline API since not all the users know this stuff.
>
> Also, in GLMNET package, there are different strategies to do feature
> scalling for linear regression
> and logistic regression; as a result, we don't want to make it public
> api naively without addressing
> different use-case.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
>
> On Wed, Nov 26, 2014 at 12:06 PM, Xiangrui Meng <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=9535&i=0>> wrote:
>
> > Hi Yanbo,
> >
> > We scale the model coefficients back after training. So scaling in
> > prediction is not necessary.
> >
> > We had some discussion about this. I'd like to treat feature scaling
> > as part of the feature transformation, and recommend users to apply
> > feature scaling before training. It is a cleaner solution to me, and
> > this is easy with the new pipeline API. DB (cc'ed) recommends
> > embedding feature scaling in linear methods, because it generally
> > leads better conditioning, which is also valid. Feel free to create a
> > JIRA and we can have the discussion there.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Nov 26, 2014 at 1:39 AM, Yanbo Liang <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=9535&i=1>> wrote:
> >> Hi All,
> >>
> >> LogisticRegressionWithLBFGS set useFeatureScaling to true default which
> can
> >> improve the convergence during optimization.
> >> However, other model training method such as LogisticRegressionWithSGD
> does
> >> not set useFeatureScaling to true by default and the corresponding set
> >> function is private in mllib scope which can not be set by users.
> >>
> >> The default configuration will cause mismatch training and prediction.
> >> Suppose that users prepare input data for training set and predict set
> with
> >> the same format, then run model training with
> LogisticRegressionWithLBFGS
> >> and prediction.
> >> But they do not know that it contains feature scaling in training step
> but
> >> w/o it in prediction step.
> >> When prediction step, it will apply model on dataset whose extent or
> scope
> >> is not consistent with training step.
> >>
> >> Should we make setFeatureScaling function to public and change default
> value
> >> to false?
> >> I think it is more clear and comprehensive to make feature scale and
> >> normalization in preprocessing step of the machine learning pipeline.
> >> If this proposal is OK, I will file a JIRA to track.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=9535&i=2>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=9535&i=3>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-useFeatureScaling-likes-hardcode-in-LogisticRegressionWithLBFGS-and-is-not-comprehensive-for-u-tp9531p9535.html
>  To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dGlhbnNoYW9jdW5AZ21haWwuY29tfDF8NjkzNjc2OTQ4>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-useFeatureScaling-likes-hardcode-in-LogisticRegressionWithLBFGS-and-is-not-comprehensive-for-u-tp9531p9537.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

Reply via email to