Re: Any plans for new clustering algorithms?

Sandy Ryza Mon, 21 Apr 2014 18:16:09 -0700

I thought this might be a good thing to add to the wiki's "How to
contribute" 
page<https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>,
as it's not tied to a release.



On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng <men...@gmail.com> wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
>
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> > How do I get permissions to edit the wiki?
> >
> >
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >
> >> Cannot agree more with your words. Could you add one section about
> >> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >> <nick.pentre...@gmail.com> wrote:
> >> > I'd say a section in the "how to contribute" page would be a good
> place
> >> to put this.
> >> >
> >> > In general I'd say that the criteria for inclusion of an algorithm is
> it
> >> should be high quality, widely known, used and accepted (citations and
> >> concrete use cases as examples of this), scalable and parallelizable,
> well
> >> documented and with reasonable expectation of dev support
> >> >
> >> > Sent from my iPhone
> >> >
> >> >> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >> >>
> >> >> If it's not done already, would it make sense to codify this
> philosophy
> >> >> somewhere?  I imagine this won't be the first time this discussion
> comes
> >> >> up, and it would be nice to have a doc to point to.  I'd be happy to
> >> take a
> >> >> stab at this.
> >> >>
> >> >>
> >> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <men...@gmail.com>
> >> wrote:
> >> >>>
> >> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >> >>> definitely need to spend more time on how to make the design
> scalable.
> >> >>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >> >>> That being said, new algorithms are welcomed. I wish they are
> >> >>> well-established and well-understood by users. They shouldn't be
> >> >>> research algorithms tuned to work well with a particular dataset but
> >> >>> not tested widely. You see the change log from Mahout:
> >> >>>
> >> >>> ===
> >> >>> The following algorithms that were marked deprecated in 0.8 have
> been
> >> >>> removed in 0.9:
> >> >>>
> >> >>> From Clustering:
> >> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
> >> >>> Variational Bayes (CVB)
> >> >>> Meanshift
> >> >>> MinHash - removed due to poor performance, lack of support and lack
> of
> >> >>> usage
> >> >>>
> >> >>> From Classification (both are sequential implementations)
> >> >>> Winnow - lack of actual usage and support
> >> >>> Perceptron - lack of actual usage and support
> >> >>>
> >> >>> Collaborative Filtering
> >> >>>    SlopeOne implementations in
> >> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >> >>>    Distributed pseudo recommender in
> >> >>> org.apache.mahout.cf.taste.hadoop.pseudo
> >> >>>    TreeClusteringRecommender in
> >> >>> org.apache.mahout.cf.taste.impl.recommender
> >> >>>
> >> >>> Mahout Math
> >> >>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >> >>> ===
> >> >>>
> >> >>> In MLlib, we should include the algorithms users know how to use and
> >> >>> we can provide support rather than letting algorithms come and go.
> >> >>>
> >> >>> My $0.02,
> >> >>> Xiangrui
> >> >>>
> >> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com>
> >> wrote:
> >> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <p...@mult.ifario.us>
> >> wrote:
> >> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some
> gems
> >> in
> >> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar
> of
> >> >>>>> working, correctly implemented, and documented requires a
> surprising
> >> >>> amount
> >> >>>>> of work.
> >> >>>>
> >> >>>> As someone with first-hand knowledge, this is correct. To Sang's
> >> >>>> question, I can't see value in 'porting' Mahout since it is based
> on a
> >> >>>> quite different paradigm. About the only part that translates is
> the
> >> >>>> algorithm concept itself.
> >> >>>>
> >> >>>> This is also the cautionary tale. The contents of the project have
> >> >>>> ended up being a number of "drive-by" contributions of
> implementations
> >> >>>> that, while individually perhaps brilliant (perhaps), didn't
> >> >>>> necessarily match any other implementation in structure,
> input/output,
> >> >>>> libraries used. The implementations were often a touch academic.
> The
> >> >>>> result was hard to document, maintain, evolve or use.
> >> >>>>
> >> >>>> Far more of the structure of the MLlib implementations are
> consistent
> >> >>>> by virtue of being built around Spark core already. That's great.
> >> >>>>
> >> >>>> One can't wait to completely build the foundation before building
> any
> >> >>>> implementations. To me, the existing implementations are almost
> >> >>>> exactly the basics I would choose. They cover the bases and will
> >> >>>> exercise the abstractions and structure. So that's also great IMHO.
> >> >>>
> >>
>

Re: Any plans for new clustering algorithms?

Reply via email to