Re: Contributing to MLlib: Proposal for Clustering Algorithms

RJ Nowling Thu, 10 Jul 2014 07:26:29 -0700

I went ahead and created JIRAs.

JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429


JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430

Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good "feel" for how to
structure such an API.  If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.


On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote:
> Cool seems like a god initiative. Adding a couple extra high quality 
> clustering implantations will be great.
>
>
> I'd say it would make most sense to submit a PR for the Standardised API 
> first, agree that with everyone and then build on it for the specific 
> implementations.
> —
> Sent from Mailbox
>
> On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>
>> Thanks everyone for the input.
>> So it seems what people want is:
>> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
>> conquer approach, look at DecisionTree implementation as a reference)
>> * Restructure 3 Kmeans clustering algorithm implementations to prevent
>> code duplication and conform to a consistent API where possible
>> If this is correct, I'll start work on that.  How would it be best to
>> structure it? Should I submit separate JIRAs / PRs for refactoring of
>> current code, MiniBatch KMeans, and Hierarchical or keep my current
>> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
>> for Hierarchical KMeans that builds on top of that?
>> Thanks!
>> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <hector....@gmail.com> wrote:
>>> Yeah if one were to replace the objective function in decision tree with
>>> minimizing the variance of the leaf nodes it would be a hierarchical
>>> clusterer.
>>>
>>>
>>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <evan.spa...@gmail.com>
>>> wrote:
>>>
>>>> If you're thinking along these lines, have a look at the DecisionTree
>>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>>> multiple passes over the data by computing several splits at each level of
>>>> tree building. The tradeoff is increased model state and computation per
>>>> pass over the data, but fewer total passes and hopefully lower
>>>> communication overheads than, say, shuffling data around that belongs to
>>>> one cluster or another. Something like that could work here as well.
>>>>
>>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>>> efficient way to implement it, though.
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <hector....@gmail.com> wrote:
>>>>
>>>> > No was thinking more top-down:
>>>> >
>>>> > assuming a distributed kmeans system already existing, recursively apply
>>>> > the kmeans algorithm on data already partitioned by the previous level of
>>>> > kmeans.
>>>> >
>>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>>> > because they assume there is already a distance metric for items to
>>>> items.
>>>> > This makes it hard to cluster new content. The distances between sibling
>>>> > clusters is also hard to compute (if you have thrown away the similarity
>>>> > matrix), do you count paths to same parent node if you are computing
>>>> > distances between items in two adjacent nodes for example. It is also a
>>>> bit
>>>> > harder to distribute the computation for bottom up approaches as you have
>>>> > to already find the nearest neighbor to an item to begin the process.
>>>> >
>>>> >
>>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rnowl...@gmail.com> wrote:
>>>> >
>>>> > > The scikit-learn implementation may be of interest:
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>>> > >
>>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>>> > > chosen to minimize variance.
>>>> > >
>>>> > > Their code is under a BSD license so it can be used as a template.
>>>> > >
>>>> > > Is something like that you were thinking Hector?
>>>> > >
>>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>> > > wrote:
>>>> > > > sure. more interesting problem here is choosing k at each level.
>>>> Kernel
>>>> > > > methods seem to be most promising.
>>>> > > >
>>>> > > >
>>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com>
>>>> > wrote:
>>>> > > >
>>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>>> > k-means
>>>> > > >> again on each cluster.
>>>> > > >>
>>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>>> > some
>>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>>> > with
>>>> > > >> negative dot or cosine may help.
>>>> > > >>
>>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>>> > why I
>>>> > > >> like hierarchical clustering is it makes for faster inference
>>>> > especially
>>>> > > >> over billions of users.
>>>> > > >>
>>>> > > >>
>>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com
>>>> >
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>>> > > thanks.
>>>> > > >> >
>>>> > > >> >
>>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com>
>>>> > > wrote:
>>>> > > >> >
>>>> > > >> > > I would say for bigdata applications the most useful would be
>>>> > > >> > hierarchical
>>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>>> > > >> > centroids.
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com
>>>> >
>>>> > > >> wrote:
>>>> > > >> > >
>>>> > > >> > > > Hi all,
>>>> > > >> > > >
>>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>>> > > KMeans.
>>>> > > >> > > > It would benefit from having implementations of other
>>>> clustering
>>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>>> Hierarchical
>>>> > > >> > > > Clustering, and Affinity Propagation.
>>>> > > >> > > >
>>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>>> > > implementation,
>>>> > > >> > > > and I saw an email on this list about interest in implementing
>>>> > > Fuzzy
>>>> > > >> > > > C-Means.
>>>> > > >> > > >
>>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>>> > became
>>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>>> it
>>>> > > would
>>>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>>>> > and
>>>> > > >> > > > implement a consistent API.
>>>> > > >> > > >
>>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>>> community:
>>>> > > >> > > >
>>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>>> > > available?
>>>> > > >> > > >
>>>> > > >> > > > 2. Is the community interested in specifying a common
>>>> framework?
>>>> > > >> > > >
>>>> > > >> > > > Thanks!
>>>> > > >> > > > RJ
>>>> > > >> > > >
>>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>>> > > >> > > >
>>>> > > >> > > >
>>>> > > >> > > > --
>>>> > > >> > > > em rnowl...@gmail.com
>>>> > > >> > > > c 954.496.2314
>>>> > > >> > > >
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > > --
>>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> > > >> > >
>>>> > > >> >
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >> --
>>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> > > >>
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > em rnowl...@gmail.com
>>>> > > c 954.496.2314
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Yee Yang Li Hector <http://google.com/+HectorYee>
>>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> --
>> em rnowl...@gmail.com
>> c 954.496.2314



-- 
em rnowl...@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Reply via email to