Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code!
I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy --------------------- jeremy freeman, phd neuroscientist @thefreemanlab On Aug 12, 2014, at 2:20 PM, RJ Nowling <rnowl...@gmail.com> wrote: > Hi all, > > I wanted to follow up. > > I have a prototype for an optimized version of hierarchical k-means. I > wanted to get some feedback on my apporach. > > Jeremy's implementation splits the largest cluster in each round. Is it > better to do it that way or to split each cluster in half? > > Are there are any open-source examples that are being widely used in > production? > > Thanks! > > > > On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rnowl...@gmail.com> wrote: > >> Nice to meet you, Jeremy! >> >> This is great! Hierarchical clustering was next on my list -- >> currently trying to get my PR for MiniBatch KMeans accepted. >> >> If it's cool with you, I'll try converting your code to fit in with >> the existing MLLib code as you suggest. I also need to review the >> Decision Tree code (as suggested above) to see how much of that can be >> reused. >> >> Maybe I can ask you to do a code review for me when I'm done? >> >> >> >> >> >> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman >> <freeman.jer...@gmail.com> wrote: >>> Hi all, >>> >>> Cool discussion! I agree that a more standardized API for clustering, and >>> easy access to underlying routines, would be useful (we've also been >>> discussing this when trying to develop streaming clustering algorithms, >>> similar to https://github.com/apache/spark/pull/1361) >>> >>> For divisive, hierarchical clustering I implemented something awhile >> back, >>> here's a gist. >>> >>> https://gist.github.com/freeman-lab/5947e7c53b368fe90371 >>> >>> It does bisecting k-means clustering (with k=2), with a recursive class >> for >>> keeping track of the tree. I also found this much better than >> agglomerative >>> methods (for the reasons Hector points out). >>> >>> This needs to be cleaned up, and can surely be optimized (esp. by >> replacing >>> the core KMeans step with existing MLLib code), but I can say I was >> running >>> it successfully on quite large data sets. >>> >>> RJ, depending on where you are in your progress, I'd be happy to help >> work >>> on this piece and / or have you use this as a jumping off point, if >> useful. >>> >>> -- Jeremy >>> >>> >>> >>> -- >>> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html >>> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> >> >> -- >> em rnowl...@gmail.com >> c 954.496.2314 >> > > > > -- > em rnowl...@gmail.com > c 954.496.2314