I thought this might be a good thing to add to the wiki's "How to contribute" page<https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>, as it's not tied to a release.
On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng <men...@gmail.com> wrote: > The markdown files are under spark/docs. You can submit a PR for > changes. -Xiangrui > > On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > > How do I get permissions to edit the wiki? > > > > > > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com> wrote: > > > >> Cannot agree more with your words. Could you add one section about > >> "how and what to contribute" to MLlib's guide? -Xiangrui > >> > >> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath > >> <nick.pentre...@gmail.com> wrote: > >> > I'd say a section in the "how to contribute" page would be a good > place > >> to put this. > >> > > >> > In general I'd say that the criteria for inclusion of an algorithm is > it > >> should be high quality, widely known, used and accepted (citations and > >> concrete use cases as examples of this), scalable and parallelizable, > well > >> documented and with reasonable expectation of dev support > >> > > >> > Sent from my iPhone > >> > > >> >> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> >> > >> >> If it's not done already, would it make sense to codify this > philosophy > >> >> somewhere? I imagine this won't be the first time this discussion > comes > >> >> up, and it would be nice to have a doc to point to. I'd be happy to > >> take a > >> >> stab at this. > >> >> > >> >> > >> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com> > >> wrote: > >> >>> > >> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we > >> >>> definitely need to spend more time on how to make the design > scalable. > >> >>> For example, think about current "ProblemWithAlgorithm" naming > scheme. > >> >>> That being said, new algorithms are welcomed. I wish they are > >> >>> well-established and well-understood by users. They shouldn't be > >> >>> research algorithms tuned to work well with a particular dataset but > >> >>> not tested widely. You see the change log from Mahout: > >> >>> > >> >>> === > >> >>> The following algorithms that were marked deprecated in 0.8 have > been > >> >>> removed in 0.9: > >> >>> > >> >>> From Clustering: > >> >>> Switched LDA implementation from using Gibbs Sampling to Collapsed > >> >>> Variational Bayes (CVB) > >> >>> Meanshift > >> >>> MinHash - removed due to poor performance, lack of support and lack > of > >> >>> usage > >> >>> > >> >>> From Classification (both are sequential implementations) > >> >>> Winnow - lack of actual usage and support > >> >>> Perceptron - lack of actual usage and support > >> >>> > >> >>> Collaborative Filtering > >> >>> SlopeOne implementations in > >> >>> org.apache.mahout.cf.taste.hadoop.slopeone and > >> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone > >> >>> Distributed pseudo recommender in > >> >>> org.apache.mahout.cf.taste.hadoop.pseudo > >> >>> TreeClusteringRecommender in > >> >>> org.apache.mahout.cf.taste.impl.recommender > >> >>> > >> >>> Mahout Math > >> >>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy > >> >>> === > >> >>> > >> >>> In MLlib, we should include the algorithms users know how to use and > >> >>> we can provide support rather than letting algorithms come and go. > >> >>> > >> >>> My $0.02, > >> >>> Xiangrui > >> >>> > >> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> > >> wrote: > >> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us> > >> wrote: > >> >>>>> - MLlib as Mahout.next would be a unfortunate. There are some > gems > >> in > >> >>>>> Mahout, but there are also lots of rocks. Setting a minimal bar > of > >> >>>>> working, correctly implemented, and documented requires a > surprising > >> >>> amount > >> >>>>> of work. > >> >>>> > >> >>>> As someone with first-hand knowledge, this is correct. To Sang's > >> >>>> question, I can't see value in 'porting' Mahout since it is based > on a > >> >>>> quite different paradigm. About the only part that translates is > the > >> >>>> algorithm concept itself. > >> >>>> > >> >>>> This is also the cautionary tale. The contents of the project have > >> >>>> ended up being a number of "drive-by" contributions of > implementations > >> >>>> that, while individually perhaps brilliant (perhaps), didn't > >> >>>> necessarily match any other implementation in structure, > input/output, > >> >>>> libraries used. The implementations were often a touch academic. > The > >> >>>> result was hard to document, maintain, evolve or use. > >> >>>> > >> >>>> Far more of the structure of the MLlib implementations are > consistent > >> >>>> by virtue of being built around Spark core already. That's great. > >> >>>> > >> >>>> One can't wait to completely build the foundation before building > any > >> >>>> implementations. To me, the existing implementations are almost > >> >>>> exactly the basics I would choose. They cover the bases and will > >> >>>> exercise the abstractions and structure. So that's also great IMHO. > >> >>> > >> >