Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release.
On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com> wrote: > Some architect suggestions on this matter - > https://github.com/apache/spark/pull/2371 > > 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>: > > > Sorry, I misswrote - I meant learners part of framework - models already > > exists. > > > > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < > > christoph.saw...@googlemail.com>: > > > >> I totally agree, and we discovered also some drawbacks with the > >> classification models implementation that are based on GLMs: > >> > >> - There is no distinction between predicting scores, classes, and > >> calibrated scores (probabilities). For these models it is common to have > >> access to all of them and the prediction function ``predict``should be > >> consistent and stateless. Currently, the score is only available after > >> removing the threshold from the model. > >> - There is no distinction between multinomial and binomial > >> classification. For multinomial problems, it is necessary to handle > >> multiple weight vectors and multiple confidences. > >> - Models are not serialisable, which makes it hard to use them in > >> practise. > >> > >> I started a pull request [1] some time ago. I would be happy to continue > >> the discussion and clarify the interfaces, too! > >> > >> Cheers, Christoph > >> > >> [1] https://github.com/apache/spark/pull/2137/ > >> > >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>: > >> > >>> Here in Yandex, during implementation of gradient boosting in spark and > >>> creating our ML tool for internal use, we found next serious problems > in > >>> MLLib: > >>> > >>> > >>> - There is no Regression/Classification model abstraction. We were > >>> building abstract data processing pipelines, which should work just > >>> with > >>> some regression - exact algorithm specified outside this code. There > >>> is no > >>> abstraction, which will allow me to do that. *(It's main reason for > >>> all > >>> further problems) * > >>> - There is no common practice among MLlib for testing algorithms: > >>> every > >>> model generates it's own random test data. There is no easy > >>> extractable > >>> test cases applible to another algorithm. There is no benchmarks for > >>> comparing algorithms. After implementing new algorithm it's very > hard > >>> to > >>> understand how it should be tested. > >>> - Lack of serialization testing: MLlib algorithms don't contain > tests > >>> which test that model work after serialization. > >>> - During implementation of new algorithm it's hard to understand > what > >>> API you should create and which interface to implement. > >>> > >>> Start for solving all these problems must be done in creating common > >>> interface for typical algorithms/models - regression, classification, > >>> clustering, collaborative filtering. > >>> > >>> All main tests should be written against these interfaces, so when new > >>> algorithm implemented - all it should do is passed already written > tests. > >>> It allow us to have managble quality among all lib. > >>> > >>> There should be couple benchmarks which allow new spark user to get > >>> feeling > >>> about which algorithm to use. > >>> > >>> Test set against these abstractions should contain serialization test. > In > >>> production most time there is no need in model, which can't be stored. > >>> > >>> As the first step of this roadmap I'd like to create trait > >>> RegressionModel, > >>> *ADD* methods to current algorithms to implement this trait and create > >>> some > >>> tests against it. Planning of doing it next week. > >>> > >>> Purpose of this letter is to collect any objections to this approach on > >>> early stage: please give any feedback. Second reason is to set lock on > >>> this > >>> activity so we wouldn't do the same thing twice: I'll create pull > request > >>> by the end of the next week and any parallalizm in development we can > >>> start > >>> from there. > >>> > >>> > >>> > >>> -- > >>> > >>> > >>> > >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex* > >>> > >> > >> > > > > > > -- > > > > > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > > > > > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >