Hi Dhar, Disabling `standardization` feature is just merged in master.
https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1 Let us know your feedback. Thanks. Sincerely, DB Tsai ---------------------------------------------------------- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Jun 16, 2015 at 9:11 PM, Dhar Sauptik (CR/RTC1.3-NA) <sauptik.d...@us.bosch.com> wrote: > Hi DB, > > That will work too. I was just suggesting that as standardization is a simple > operation and could have been performed explicitly. > > Thank you for the replies. > > -Sauptik. > > -----Original Message----- > From: DB Tsai [mailto:dbt...@dbtsai.com] > Sent: Tuesday, June 16, 2015 9:04 PM > To: Dhar Sauptik (CR/RTC1.3-NA) > Cc: Ramakrishnan Naveen (CR/RTC1.3-NA); user@spark.apache.org > Subject: Re: FW: MLLIB (Spark) Question. > > Hi Dhar, > > For "standardization", we can disable it effectively by using > different regularization on each component. Thus, we're solving the > same problem but having better rate of convergence. This is one of the > features I will implement. > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Blog: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > > On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA) > <sauptik.d...@us.bosch.com> wrote: >> Hi DB, >> >> Thank you for the reply. The answers makes sense. I do have just one more >> point to add. >> >> Note that it may be better to not implicitly standardize the data. Agreed >> that a number of algorithms benefit from such standardization, but for many >> applications with contextual information such standardization "may" not be >> desirable. >> Users can always perform the standardization themselves. >> >> However, that's just a suggestion. Again, thank you for the clarification. >> >> Thanks, >> Sauptik. >> >> >> -----Original Message----- >> From: DB Tsai [mailto:dbt...@dbtsai.com] >> Sent: Tuesday, June 16, 2015 2:49 PM >> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA) >> Cc: user@spark.apache.org >> Subject: Re: FW: MLLIB (Spark) Question. >> >> +cc user@spark.apache.org >> >> Reply inline. >> >> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA) >> <Sauptik.Dhar> wrote: >>> Hi DB, >>> >>> Thank you for the reply. That explains a lot. >>> >>> I however had a few points regarding this:- >>> >>> 1. Just to help with the debate of not regularizing the b parameter. A >>> standard implementation argues against regularizing the b parameter. See Pg >>> 64 para 1 : http://statweb.stanford.edu/~tibs/ElemStatLearn/ >>> >> >> Agreed. We just worry about it will change behavior, but we actually >> have a PR to change the behavior to standard one, >> https://github.com/apache/spark/pull/6386 >> >>> 2. Further, is the regularization of b also applicable for the SGD >>> implementation. Currently the SGD vs. BFGS implementations give different >>> results (and both the implementations don't match the IRLS algorithm). Are >>> the SGD/BFGS implemented for different loss functions? Can you please share >>> your thoughts on this. >>> >> >> In SGD implementation, we don't "standardize" the dataset before >> training. As a result, those columns with low standard deviation will >> be penalized more, and those with high standard deviation will be >> penalized less. Also, "standardize" will help the rate of convergence. >> As a result, in most of package, they "standardize" the data >> implicitly, and get the weights in the "standardized" space, and >> transform back to original space so it's transparent for users. >> >> 1) LORWithSGD: No standardization, and penalize the intercept. >> 2) LORWithLBFGS: With standardization but penalize the intercept. >> 3) New LOR implementation: With standardization without penalizing the >> intercept. >> >> As a result, only the new implementation in Spark ML handles >> everything correctly. We have tests to verify that the results match >> R. >> >>> >>> @Naveen: Please feel free to add/comment on the above points as you see >>> necessary. >>> >>> Thanks, >>> Sauptik. >>> >>> -----Original Message----- >>> From: DB Tsai >>> Sent: Tuesday, June 16, 2015 2:08 PM >>> To: Ramakrishnan Naveen (CR/RTC1.3-NA) >>> Cc: Dhar Sauptik (CR/RTC1.3-NA) >>> Subject: Re: FW: MLLIB (Spark) Question. >>> >>> Hey, >>> >>> In the LORWithLBFGS api you use, the intercept is regularized while >>> other implementations don't regularize the intercept. That's why you >>> see the difference. >>> >>> The intercept should not be regularized, so we fix this in new Spark >>> ML api in spark 1.4. Since this will change the behavior in the old >>> api if we decide to not regularize the intercept in old version, we >>> are still debating about this. >>> >>> See the following code for full running example in spark 1.4 >>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala >>> >>> And also check out my talk at spark summit. >>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> ---------------------------------------------------------- >>> Blog: https://www.dbtsai.com >>> PGP Key ID: 0xAF08DF8D >>> >>> >>> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA) >>> <Naveen.Ramakrishnan> wrote: >>>> Hi DB, >>>> Hope you are doing well! One of my colleagues, Sauptik, is working with >>>> MLLib and the logistic regression based on LBFGS and is having trouble >>>> reproducing the same results when compared to Matlab. Please see below for >>>> details. I did take a look into this but seems like there’s also >>>> discrepancy >>>> between the logistic regression with SGD and LBFGS implementations in >>>> MLLib. >>>> We have attached all the codes for your analysis – it’s in PySpark though. >>>> Let us know if you have any questions or concerns. We would very much >>>> appreciate your help whenever you get a chance. >>>> >>>> Best, >>>> Naveen. >>>> >>>> _____________________________________________ >>>> From: Dhar Sauptik (CR/RTC1.3-NA) >>>> Sent: Thursday, June 11, 2015 6:03 PM >>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA) >>>> Subject: MLLIB (Spark) Question. >>>> >>>> >>>> Hi Naveen, >>>> >>>> I am writing this owing to some MLLIB issues I found while using Logistic >>>> Regression. Basically, I am trying to test the stability of the L1/L2 – >>>> Logistic Regression using SGD and BFGS. Unfortunately I am unable to >>>> confirm >>>> the correctness of the algorithms. For comparison I implemented the >>>> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the >>>> book >>>> http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf >>>> . Unfortunately the solutions don’t match:- >>>> >>>> For example:- >>>> >>>> Using the Publicly available data (diabetes.csv) for L2 regularized >>>> Logistic >>>> Regression (with lamda = 0.1) we get, >>>> >>>> Solutions >>>> >>>> MATLAB CODE (IRLS):- >>>> >>>> w = 0.294293470805555 >>>> 0.550681766045083 >>>> 0.0396336870148899 >>>> 0.0641285712055971 >>>> 0.101238592147879 >>>> 0.261153541551578 >>>> 0.178686710290069 >>>> >>>> b= -0.347396594061553 >>>> >>>> >>>> MLLIB (SGD):- >>>> (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714], >>>> intercept=-0.00749988882664631) >>>> >>>> >>>> MLLIB(LBFGS):- >>>> (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394], >>>> intercept=-0.027401869113912316) >>>> >>>> >>>> All the codes are attached to the email. >>>> >>>> Apparently the solutions are quite far away from the optimal (and even from >>>> each other)! Can you please check with DB Tsai on the reasons for such >>>> differences? Note all the additional parameters are described in the source >>>> codes. >>>> >>>> >>>> Thanks, >>>> Best regards / Mit freundlichen Grüßen, >>>> >>>> Sauptik Dhar, Ph.D. >>>> CR/RTC1.3-NA >>>> >>>> >> >> Sincerely, >> >> DB Tsai >> ---------------------------------------------------------- >> Blog: https://www.dbtsai.com >> PGP Key ID: 0xAF08DF8D --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org