Hi Tsai, Thank you for pointing out the implementation details which I missed. Yes I saw several jira issues with the intercept, regularization and standardization, I just didn't realize it made such a big impact. Thanks again.
2015-10-13 4:32 GMT+08:00 DB Tsai <dbt...@dbtsai.com>: > Hi Liu, > > In ML, even after extracting the data into RDD, the versions between MLib > and ML are quite different. Due to legacy design, in MLlib, we use Updater > for handling regularization, and this layer of abstraction also does > adaptive step size which is only for SGD. In order to get it working with > LBFGS, some hacks were being done here and there, and in Updater, all the > components including intercept are regularized which is not desirable in > many cases. Also, in the legacy design, it's hard for us to do in-place > standardization to improve the convergency rate. As a result, at some point, > we decide to ditch those abstractions, and customize them for each > algorithms. (Even LiR and LoR use different tricks to have better > performance for numerical optimization, so it's hard to share code at that > time. But I can see the point that we have working code now, so it's time to > try to refactor those code to share more.) > > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Blog: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <javeli...@gmail.com> wrote: >> >> Hi Joseph, >> >> Thank you for clarifying the motivation that you setup a different API >> for ml pipelines, it sounds great. But I still think we could extract >> some common parts of the training & inference procedures for ml and >> mllib. In ml.classification.LogisticRegression, you simply transform >> the DataFrame into RDD and follow the same procedures in >> mllib.optimization.{LBFGS,OWLQN}, right? >> >> My suggestion is, if I may, ml package should focus on the public API, >> and leave the underlying implementations, e.g. numerical optimization, >> to mllib package. >> >> Please let me know if my understanding has any problem. Thank you! >> >> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jos...@databricks.com>: >> > Hi YiZhi Liu, >> > >> > The spark.ml classes are part of the higher-level "Pipelines" API, which >> > works with DataFrames. When creating this API, we decided to separate >> > it >> > from the old API to avoid confusion. You can read more about it here: >> > http://spark.apache.org/docs/latest/ml-guide.html >> > >> > For (3): We use Breeze, but we have to modify it in order to do >> > distributed >> > optimization based on Spark. >> > >> > Joseph >> > >> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <javeli...@gmail.com> wrote: >> >> >> >> Hi everyone, >> >> >> >> I'm curious about the difference between >> >> ml.classification.LogisticRegression and >> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are >> >> optimized using LBFGS, the only difference I see is LogisticRegression >> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD. >> >> >> >> So I wonder, >> >> 1. Why not simply add a DataFrame training interface to >> >> LogisticRegressionWithLBFGS? >> >> 2. Whats the difference between ml.classification and >> >> mllib.classification package? >> >> 3. Why doesn't ml.classification.LogisticRegression call >> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead, >> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures >> >> in mllib.optimization.{LBFGS,OWLQN}. >> >> >> >> Thank you. >> >> >> >> Best, >> >> >> >> -- >> >> Yizhi Liu >> >> Senior Software Engineer / Data Mining >> >> www.mvad.com, Shanghai, China >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> > >> >> >> >> -- >> Yizhi Liu >> Senior Software Engineer / Data Mining >> www.mvad.com, Shanghai, China >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > -- Yizhi Liu Senior Software Engineer / Data Mining www.mvad.com, Shanghai, China --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org