Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Thanks, Jeremy. I'm abandoning my initial approach, and I'll work on optimizing your example (so it doesn't do the breeze-vector conversions every time KMeans is called). I need to finish a few other projects first, though, so it may be a couple weeks. In the mean time, Yu also created a JIRA fo

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscientist

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Hi Yu, A standardized API has not been implemented yet. I think it would be better to implement the other clustering algorithms then extract a common API. Others may feel differently. :) Just a note, there was a pre-existing JIRA for hierarchical KMeans SPARK-2429

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, plea

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better to do it that way or to split each cluster in half? Are there are an

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark Devel

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling
Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hier

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling wrote: > I went ahead and created JIRAs. > JIRA for Hierarchical Clustering: > https://issues.apache.org/jira/browse/SPARK-2429 > JIRA for Standarized Cluste

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorith

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On We

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and confor

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks wrote: > If you're thinking along these lines, have a look at the DecisionTree > implementation

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass o

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a dis

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics, 2^10

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee wrote: > No idea, never looked it up. Always just implemented it as doing k-means > again on each cluster. > > FWIW standard k-means with euclide

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clust

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee wrote: > I would say for bigdata applications the most useful would be hierarchical > k-means with back tracking and the ability to support k nearest centroids. > > > On Tue, Jul

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza
Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On Tue,

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee wrote: > I would say for bigdata applications the most useful would be hierarchical > k-means with back tracking and the ability to support k nearest centroids. > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling > wrot

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling wrote: > Hi all, > > MLlib currently has one clustering algorithm implementation, KMeans. > It would