I totally agree, and we discovered also some drawbacks with the classification models implementation that are based on GLMs:
- There is no distinction between predicting scores, classes, and calibrated scores (probabilities). For these models it is common to have access to all of them and the prediction function ``predict``should be consistent and stateless. Currently, the score is only available after removing the threshold from the model. - There is no distinction between multinomial and binomial classification. For multinomial problems, it is necessary to handle multiple weight vectors and multiple confidences. - Models are not serialisable, which makes it hard to use them in practise. I started a pull request [1] some time ago. I would be happy to continue the discussion and clarify the interfaces, too! Cheers, Christoph [1] https://github.com/apache/spark/pull/2137/ 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>: > Here in Yandex, during implementation of gradient boosting in spark and > creating our ML tool for internal use, we found next serious problems in > MLLib: > > > - There is no Regression/Classification model abstraction. We were > building abstract data processing pipelines, which should work just with > some regression - exact algorithm specified outside this code. There is > no > abstraction, which will allow me to do that. *(It's main reason for all > further problems) * > - There is no common practice among MLlib for testing algorithms: every > model generates it's own random test data. There is no easy extractable > test cases applible to another algorithm. There is no benchmarks for > comparing algorithms. After implementing new algorithm it's very hard to > understand how it should be tested. > - Lack of serialization testing: MLlib algorithms don't contain tests > which test that model work after serialization. > - During implementation of new algorithm it's hard to understand what > API you should create and which interface to implement. > > Start for solving all these problems must be done in creating common > interface for typical algorithms/models - regression, classification, > clustering, collaborative filtering. > > All main tests should be written against these interfaces, so when new > algorithm implemented - all it should do is passed already written tests. > It allow us to have managble quality among all lib. > > There should be couple benchmarks which allow new spark user to get feeling > about which algorithm to use. > > Test set against these abstractions should contain serialization test. In > production most time there is no need in model, which can't be stored. > > As the first step of this roadmap I'd like to create trait RegressionModel, > *ADD* methods to current algorithms to implement this trait and create some > tests against it. Planning of doing it next week. > > Purpose of this letter is to collect any objections to this approach on > early stage: please give any feedback. Second reason is to set lock on this > activity so we wouldn't do the same thing twice: I'll create pull request > by the end of the next week and any parallalizm in development we can start > from there. > > > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >