Hmmm, you may have to dumb things down for me here. I have don't have much of a background in the area of ML and I'm just piecing things together and learning as I go. So I don't really understand what you mean by "Coherence against an external standard? Or internal consistency/homogeneity?" or "One thought along these lines is to add L_1 regularization to the k-means algorithm." Is L_1 regularization the same as manhattan distance?
That aside I'm outputting a file with the top terms and the text of 20 random documents that ended up in that cluster and eyeballing that, not very high-tech or efficient but it was the only way I knew to make a relevance judgment on a cluster topic. For example If the majority of the samples are sport related and 82.6% of the vector distances in my cluster are quite similar I'm happy to call that cluster sport. On 26 Feb 2013, at 22:00, Ted Dunning wrote: > Chris, > > How are you doing your manual judgement step? Coherence against an > external standard? Or internal consistency/homogeneity? > > Except for unusual situations it is to be expected that most clusterings > are not particularly stable (i.e. will no reproduce the same clusters from > run to run). As such, it is also unlikely that they will reproduce > externally defined clusters any more than they will reproduce their own > results. > > Likewise, there is no guarantee that the results will be easily > interpretable. One thought along these lines is to add L_1 regularization > to the k-means algorithm. Another is to look into what the carrot project > has done where, according to the developers, they have put some effort into > making clusters that are easily summarizable. This might be similar in > effect to the regularization step I just mentioned. > > On Tue, Feb 26, 2013 at 7:02 AM, Chris Harrington <[email protected]>wrote: > >> Well, what I'm trying to do is create clusters of topically similar >> content via kmeans. >> >> Since I'm basing validity on topics there's a manual judgement step. >> And that manual step is taking a prohibitive amount of time to heck many >> clustering runs hence the desire for some stats to indicate roughly how >> good the clusters are. >> >> So I' want some stats that, at a glance, I'll be able to tell which >> clusters "should" be good and manually check them instead of having to >> check each and every one. >> >> I was thinking that a file with >> >> 1. the number of clusters, >> 2. the avg of all points to every other point >> 3. the avg distance of the points furthest from the center to all other >> points, (furthest 25% of all points within a cluster) >> 4. the avg distance of the points closest to the center to all other point >> (closest 25% of all points within a cluster) >> >> would allow me to quickly see if I should even bother manually checking >> the clustering output, the logic being that if 4,3 and 2 are similar in >> value then it's probably a decent cluster and I can manually check it. Also >> a comparison of 3 vs 2 would indicate if the cluster contains a number of >> distant outliers and 4 vs 2 would should show roughly how dense a cluster >> is. >> >> This makes sense right? or am I barking up the wrong tree? >> >> On 25 Feb 2013, at 20:15, Ted Dunning wrote: >> >>> The best way to evaluate a cluster really depends on what your purpose >> is. >>> >>> My own purpose is typically to use the clustering as a description of the >>> probability distribution of data. >>> >>> For that purpose, the best evaluation is distance to centroids for >> held-out >>> data. The use of held-out data is critical here since otherwise you >> could >>> just put a single cluster at every data point and get zero distance for >> the >>> original data. For held-out data, of course, the story would be >> different. >>> >>> This view of things is very good from the standpoint of machine learning >>> and data compression, but might be less useful for certain purposes that >>> have to do with explanation of data in human readable form. My >> experience >>> is that it is common for a clustering algorithm to be very good as a >>> probability distribution description but quite bad for human inspection. >>> >>> My own tendency would be to adapt the outline you gave to work on >> held-out >>> data instead of the original training data. >>> >>> On Mon, Feb 25, 2013 at 4:27 AM, Chris Harrington <[email protected] >>> wrote: >>> >>>> Hi all, >>>> >>>> I want to find all the vectors within a cluster and then find the >> distance >>>> between them and every other vector within a cluster, in hopes this will >>>> give me a good idea of how similar each vector within a cluster is as >> well >>>> as identify outlier vectors. >>>> >>>> So there are 2 things I want to ask. >>>> >>>> 1. Is this a sensible approach to evaluating the cluster quality? >>>> >>>> 2. Is the correct file to get this info from the >>>> clusteredPoints/parts-m-00000 file? >>>> >>>> Thanks, >>>> Chris >>>> >>>> >>>> >> >>
