Xiangrui can comment more, but I believe Joseph and him are actually
working on standardize interface and pipeline feature for 1.2 release.

On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com>
wrote:

> Some architect suggestions on this matter -
> https://github.com/apache/spark/pull/2371
>
> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
> > Sorry, I misswrote  - I meant learners part of framework - models already
> > exists.
> >
> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
> > christoph.saw...@googlemail.com>:
> >
> >> I totally agree, and we discovered also some drawbacks with the
> >> classification models implementation that are based on GLMs:
> >>
> >> - There is no distinction between predicting scores, classes, and
> >> calibrated scores (probabilities). For these models it is common to have
> >> access to all of them and the prediction function ``predict``should be
> >> consistent and stateless. Currently, the score is only available after
> >> removing the threshold from the model.
> >> - There is no distinction between multinomial and binomial
> >> classification. For multinomial problems, it is necessary to handle
> >> multiple weight vectors and multiple confidences.
> >> - Models are not serialisable, which makes it hard to use them in
> >> practise.
> >>
> >> I started a pull request [1] some time ago. I would be happy to continue
> >> the discussion and clarify the interfaces, too!
> >>
> >> Cheers, Christoph
> >>
> >> [1] https://github.com/apache/spark/pull/2137/
> >>
> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
> >>
> >>> Here in Yandex, during implementation of gradient boosting in spark and
> >>> creating our ML tool for internal use, we found next serious problems
> in
> >>> MLLib:
> >>>
> >>>
> >>>    - There is no Regression/Classification model abstraction. We were
> >>>    building abstract data processing pipelines, which should work just
> >>> with
> >>>    some regression - exact algorithm specified outside this code. There
> >>> is no
> >>>    abstraction, which will allow me to do that. *(It's main reason for
> >>> all
> >>>    further problems) *
> >>>    - There is no common practice among MLlib for testing algorithms:
> >>> every
> >>>    model generates it's own random test data. There is no easy
> >>> extractable
> >>>    test cases applible to another algorithm. There is no benchmarks for
> >>>    comparing algorithms. After implementing new algorithm it's very
> hard
> >>> to
> >>>    understand how it should be tested.
> >>>    - Lack of serialization testing: MLlib algorithms don't contain
> tests
> >>>    which test that model work after serialization.
> >>>    - During implementation of new algorithm it's hard to understand
> what
> >>>    API you should create and which interface to implement.
> >>>
> >>> Start for solving all these problems must be done in creating common
> >>> interface for typical algorithms/models - regression, classification,
> >>> clustering, collaborative filtering.
> >>>
> >>> All main tests should be written against these interfaces, so when new
> >>> algorithm implemented - all it should do is passed already written
> tests.
> >>> It allow us to have managble quality among all lib.
> >>>
> >>> There should be couple benchmarks which allow new spark user to get
> >>> feeling
> >>> about which algorithm to use.
> >>>
> >>> Test set against these abstractions should contain serialization test.
> In
> >>> production most time there is no need in model, which can't be stored.
> >>>
> >>> As the first step of this roadmap I'd like to create trait
> >>> RegressionModel,
> >>> *ADD* methods to current algorithms to implement this trait and create
> >>> some
> >>> tests against it. Planning of doing it next week.
> >>>
> >>> Purpose of this letter is to collect any objections to this approach on
> >>> early stage: please give any feedback. Second reason is to set lock on
> >>> this
> >>> activity so we wouldn't do the same thing twice: I'll create pull
> request
> >>> by the end of the next week and any parallalizm in development we can
> >>> start
> >>> from there.
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>>
> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >>>
> >>
> >>
> >
> >
> > --
> >
> >
> >
> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
> >
>
>
>
> --
>
>
>
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>

Reply via email to