We typically post design docs on JIRA's before major work starts. For instance, pretty sure SPARk-1856 will have a design doc posted shortly.
On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson <e...@redhat.com> wrote: > > Are interface designs being captured anywhere as documents that the community > can follow along with as the proposals evolve? > > I've worked on other open source projects where design docs were published as > "living documents" (e.g. on google docs, or etherpad, but the particular > mechanism isn't crucial). FWIW, I found that to be a good way to work in a > community environment. > > > ----- Original Message ----- >> Hi Egor, >> >> Thanks for the feedback! We are aware of some of the issues you >> mentioned and there are JIRAs created for them. Specifically, I'm >> pushing out the design on pipeline features and algorithm/model >> parameters this week. We can move our discussion to >> https://issues.apache.org/jira/browse/SPARK-1856 . >> >> It would be nice to make tests against interfaces. But it definitely >> needs more discussion before making PRs. For example, we discussed the >> learning interfaces in Christoph's PR >> (https://github.com/apache/spark/pull/2137/) but it takes time to >> reach a consensus, especially on interfaces. Hopefully all of us could >> benefit from the discussion. The best practice is to break down the >> proposal into small independent piece and discuss them on the JIRA >> before submitting PRs. >> >> For performance tests, there is a spark-perf package >> (https://github.com/databricks/spark-perf) and we added performance >> tests for MLlib in v1.1. But definitely more work needs to be done. >> >> The dev-list may not be a good place for discussion on the design, >> could you create JIRAs for each of the issues you pointed out, and we >> track the discussion on JIRA? Thanks! >> >> Best, >> Xiangrui >> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <r...@databricks.com> wrote: >> > Xiangrui can comment more, but I believe Joseph and him are actually >> > working on standardize interface and pipeline feature for 1.2 release. >> > >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com> >> > wrote: >> > >> >> Some architect suggestions on this matter - >> >> https://github.com/apache/spark/pull/2371 >> >> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>: >> >> >> >> > Sorry, I misswrote - I meant learners part of framework - models >> >> > already >> >> > exists. >> >> > >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < >> >> > christoph.saw...@googlemail.com>: >> >> > >> >> >> I totally agree, and we discovered also some drawbacks with the >> >> >> classification models implementation that are based on GLMs: >> >> >> >> >> >> - There is no distinction between predicting scores, classes, and >> >> >> calibrated scores (probabilities). For these models it is common to >> >> >> have >> >> >> access to all of them and the prediction function ``predict``should be >> >> >> consistent and stateless. Currently, the score is only available after >> >> >> removing the threshold from the model. >> >> >> - There is no distinction between multinomial and binomial >> >> >> classification. For multinomial problems, it is necessary to handle >> >> >> multiple weight vectors and multiple confidences. >> >> >> - Models are not serialisable, which makes it hard to use them in >> >> >> practise. >> >> >> >> >> >> I started a pull request [1] some time ago. I would be happy to >> >> >> continue >> >> >> the discussion and clarify the interfaces, too! >> >> >> >> >> >> Cheers, Christoph >> >> >> >> >> >> [1] https://github.com/apache/spark/pull/2137/ >> >> >> >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>: >> >> >> >> >> >>> Here in Yandex, during implementation of gradient boosting in spark >> >> >>> and >> >> >>> creating our ML tool for internal use, we found next serious problems >> >> in >> >> >>> MLLib: >> >> >>> >> >> >>> >> >> >>> - There is no Regression/Classification model abstraction. We were >> >> >>> building abstract data processing pipelines, which should work just >> >> >>> with >> >> >>> some regression - exact algorithm specified outside this code. >> >> >>> There >> >> >>> is no >> >> >>> abstraction, which will allow me to do that. *(It's main reason for >> >> >>> all >> >> >>> further problems) * >> >> >>> - There is no common practice among MLlib for testing algorithms: >> >> >>> every >> >> >>> model generates it's own random test data. There is no easy >> >> >>> extractable >> >> >>> test cases applible to another algorithm. There is no benchmarks >> >> >>> for >> >> >>> comparing algorithms. After implementing new algorithm it's very >> >> hard >> >> >>> to >> >> >>> understand how it should be tested. >> >> >>> - Lack of serialization testing: MLlib algorithms don't contain >> >> tests >> >> >>> which test that model work after serialization. >> >> >>> - During implementation of new algorithm it's hard to understand >> >> what >> >> >>> API you should create and which interface to implement. >> >> >>> >> >> >>> Start for solving all these problems must be done in creating common >> >> >>> interface for typical algorithms/models - regression, classification, >> >> >>> clustering, collaborative filtering. >> >> >>> >> >> >>> All main tests should be written against these interfaces, so when new >> >> >>> algorithm implemented - all it should do is passed already written >> >> tests. >> >> >>> It allow us to have managble quality among all lib. >> >> >>> >> >> >>> There should be couple benchmarks which allow new spark user to get >> >> >>> feeling >> >> >>> about which algorithm to use. >> >> >>> >> >> >>> Test set against these abstractions should contain serialization test. >> >> In >> >> >>> production most time there is no need in model, which can't be stored. >> >> >>> >> >> >>> As the first step of this roadmap I'd like to create trait >> >> >>> RegressionModel, >> >> >>> *ADD* methods to current algorithms to implement this trait and create >> >> >>> some >> >> >>> tests against it. Planning of doing it next week. >> >> >>> >> >> >>> Purpose of this letter is to collect any objections to this approach >> >> >>> on >> >> >>> early stage: please give any feedback. Second reason is to set lock on >> >> >>> this >> >> >>> activity so we wouldn't do the same thing twice: I'll create pull >> >> request >> >> >>> by the end of the next week and any parallalizm in development we can >> >> >>> start >> >> >>> from there. >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> >> >> >>> >> >> >>> >> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> >> >>> >> >> >> >> >> >> >> >> > >> >> > >> >> > -- >> >> > >> >> > >> >> > >> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> >> > >> >> >> >> >> >> >> >> -- >> >> >> >> >> >> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org