Re: Adding abstraction in MLlib

Patrick Wendell Fri, 12 Sep 2014 13:01:08 -0700

We typically post design docs on JIRA's before major work starts. For
instance, pretty sure SPARk-1856 will have a design doc posted
shortly.


On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson <e...@redhat.com> wrote:
>
> Are interface designs being captured anywhere as documents that the community 
> can follow along with as the proposals evolve?
>
> I've worked on other open source projects where design docs were published as 
> "living documents" (e.g. on google docs, or etherpad, but the particular 
> mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
> community environment.
>
>
> ----- Original Message -----
>> Hi Egor,
>>
>> Thanks for the feedback! We are aware of some of the issues you
>> mentioned and there are JIRAs created for them. Specifically, I'm
>> pushing out the design on pipeline features and algorithm/model
>> parameters this week. We can move our discussion to
>> https://issues.apache.org/jira/browse/SPARK-1856 .
>>
>> It would be nice to make tests against interfaces. But it definitely
>> needs more discussion before making PRs. For example, we discussed the
>> learning interfaces in Christoph's PR
>> (https://github.com/apache/spark/pull/2137/) but it takes time to
>> reach a consensus, especially on interfaces. Hopefully all of us could
>> benefit from the discussion. The best practice is to break down the
>> proposal into small independent piece and discuss them on the JIRA
>> before submitting PRs.
>>
>> For performance tests, there is a spark-perf package
>> (https://github.com/databricks/spark-perf) and we added performance
>> tests for MLlib in v1.1. But definitely more work needs to be done.
>>
>> The dev-list may not be a good place for discussion on the design,
>> could you create JIRAs for each of the issues you pointed out, and we
>> track the discussion on JIRA? Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <r...@databricks.com> wrote:
>> > Xiangrui can comment more, but I believe Joseph and him are actually
>> > working on standardize interface and pipeline feature for 1.2 release.
>> >
>> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> > wrote:
>> >
>> >> Some architect suggestions on this matter -
>> >> https://github.com/apache/spark/pull/2371
>> >>
>> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>:
>> >>
>> >> > Sorry, I misswrote  - I meant learners part of framework - models
>> >> > already
>> >> > exists.
>> >> >
>> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> >> > christoph.saw...@googlemail.com>:
>> >> >
>> >> >> I totally agree, and we discovered also some drawbacks with the
>> >> >> classification models implementation that are based on GLMs:
>> >> >>
>> >> >> - There is no distinction between predicting scores, classes, and
>> >> >> calibrated scores (probabilities). For these models it is common to
>> >> >> have
>> >> >> access to all of them and the prediction function ``predict``should be
>> >> >> consistent and stateless. Currently, the score is only available after
>> >> >> removing the threshold from the model.
>> >> >> - There is no distinction between multinomial and binomial
>> >> >> classification. For multinomial problems, it is necessary to handle
>> >> >> multiple weight vectors and multiple confidences.
>> >> >> - Models are not serialisable, which makes it hard to use them in
>> >> >> practise.
>> >> >>
>> >> >> I started a pull request [1] some time ago. I would be happy to
>> >> >> continue
>> >> >> the discussion and clarify the interfaces, too!
>> >> >>
>> >> >> Cheers, Christoph
>> >> >>
>> >> >> [1] https://github.com/apache/spark/pull/2137/
>> >> >>
>> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>> >> >>
>> >> >>> Here in Yandex, during implementation of gradient boosting in spark
>> >> >>> and
>> >> >>> creating our ML tool for internal use, we found next serious problems
>> >> in
>> >> >>> MLLib:
>> >> >>>
>> >> >>>
>> >> >>>    - There is no Regression/Classification model abstraction. We were
>> >> >>>    building abstract data processing pipelines, which should work just
>> >> >>> with
>> >> >>>    some regression - exact algorithm specified outside this code.
>> >> >>>    There
>> >> >>> is no
>> >> >>>    abstraction, which will allow me to do that. *(It's main reason for
>> >> >>> all
>> >> >>>    further problems) *
>> >> >>>    - There is no common practice among MLlib for testing algorithms:
>> >> >>> every
>> >> >>>    model generates it's own random test data. There is no easy
>> >> >>> extractable
>> >> >>>    test cases applible to another algorithm. There is no benchmarks
>> >> >>>    for
>> >> >>>    comparing algorithms. After implementing new algorithm it's very
>> >> hard
>> >> >>> to
>> >> >>>    understand how it should be tested.
>> >> >>>    - Lack of serialization testing: MLlib algorithms don't contain
>> >> tests
>> >> >>>    which test that model work after serialization.
>> >> >>>    - During implementation of new algorithm it's hard to understand
>> >> what
>> >> >>>    API you should create and which interface to implement.
>> >> >>>
>> >> >>> Start for solving all these problems must be done in creating common
>> >> >>> interface for typical algorithms/models - regression, classification,
>> >> >>> clustering, collaborative filtering.
>> >> >>>
>> >> >>> All main tests should be written against these interfaces, so when new
>> >> >>> algorithm implemented - all it should do is passed already written
>> >> tests.
>> >> >>> It allow us to have managble quality among all lib.
>> >> >>>
>> >> >>> There should be couple benchmarks which allow new spark user to get
>> >> >>> feeling
>> >> >>> about which algorithm to use.
>> >> >>>
>> >> >>> Test set against these abstractions should contain serialization test.
>> >> In
>> >> >>> production most time there is no need in model, which can't be stored.
>> >> >>>
>> >> >>> As the first step of this roadmap I'd like to create trait
>> >> >>> RegressionModel,
>> >> >>> *ADD* methods to current algorithms to implement this trait and create
>> >> >>> some
>> >> >>> tests against it. Planning of doing it next week.
>> >> >>>
>> >> >>> Purpose of this letter is to collect any objections to this approach
>> >> >>> on
>> >> >>> early stage: please give any feedback. Second reason is to set lock on
>> >> >>> this
>> >> >>> activity so we wouldn't do the same thing twice: I'll create pull
>> >> request
>> >> >>> by the end of the next week and any parallalizm in development we can
>> >> >>> start
>> >> >>> from there.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> >
>> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >>
>> >>
>> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Adding abstraction in MLlib

Reply via email to