I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:

- There is no distinction between predicting scores, classes, and
calibrated scores (probabilities). For these models it is common to have
access to all of them and the prediction function ``predict``should be
consistent and stateless. Currently, the score is only available after
removing the threshold from the model.
- There is no distinction between multinomial and binomial classification.
For multinomial problems, it is necessary to handle multiple weight vectors
and multiple confidences.
- Models are not serialisable, which makes it hard to use them in practise.

I started a pull request [1] some time ago. I would be happy to continue
the discussion and clarify the interfaces, too!

Cheers, Christoph

[1] https://github.com/apache/spark/pull/2137/

2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:

> Here in Yandex, during implementation of gradient boosting in spark and
> creating our ML tool for internal use, we found next serious problems in
> MLLib:
>
>
>    - There is no Regression/Classification model abstraction. We were
>    building abstract data processing pipelines, which should work just with
>    some regression - exact algorithm specified outside this code. There is
> no
>    abstraction, which will allow me to do that. *(It's main reason for all
>    further problems) *
>    - There is no common practice among MLlib for testing algorithms: every
>    model generates it's own random test data. There is no easy extractable
>    test cases applible to another algorithm. There is no benchmarks for
>    comparing algorithms. After implementing new algorithm it's very hard to
>    understand how it should be tested.
>    - Lack of serialization testing: MLlib algorithms don't contain tests
>    which test that model work after serialization.
>    - During implementation of new algorithm it's hard to understand what
>    API you should create and which interface to implement.
>
> Start for solving all these problems must be done in creating common
> interface for typical algorithms/models - regression, classification,
> clustering, collaborative filtering.
>
> All main tests should be written against these interfaces, so when new
> algorithm implemented - all it should do is passed already written tests.
> It allow us to have managble quality among all lib.
>
> There should be couple benchmarks which allow new spark user to get feeling
> about which algorithm to use.
>
> Test set against these abstractions should contain serialization test. In
> production most time there is no need in model, which can't be stored.
>
> As the first step of this roadmap I'd like to create trait RegressionModel,
> *ADD* methods to current algorithms to implement this trait and create some
> tests against it. Planning of doing it next week.
>
> Purpose of this letter is to collect any objections to this approach on
> early stage: please give any feedback. Second reason is to set lock on this
> activity so we wouldn't do the same thing twice: I'll create pull request
> by the end of the next week and any parallalizm in development we can start
> from there.
>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>

Reply via email to