Hello,
I have some questions around large-scale clustering. I would like to arrive at a methodology that I can use to determine an appropriate value of K to run K-means clustering for (at least for my scenario, if not in general). More details follow below (apologies for the verbosity, but I wanted to provide as much context as I could). By 'large' I mean to imply: * large in the number of points to cluster * large in the dimensionality of the vector/feature-space * large in the number of clusters It would be doubly great if someone here has had to perform clustering in a similar setting (similar in terms of the number of datapoints, the nature/type of data being clustered, the number of clusters being formed, etc.) and are willing to share their war story. :: The context :: I'm trying to cluster ~ 18 million sparse term-frequency vectors with Mahout 0.9. Although these vectors were not derived from documents in a text corpus, they were generated based on each data point being associated with a finite number of discrete symbols. The size of the overall symbol-set is quite large, hence the sparsity of the vector representation. I assumed that I don't require idf-weighting because none of these symbols can occur more than once per datapoint/vector. Additionally, each vector has a fixed number of non-zero dimensions. I'm using the seq2sparse program to generate these tf vectors from the raw symbol-based inputs. For my case, the notion of similarity/distance is based on the overlap between the symbol-sets of the two vectors under consideration. Hence, I'm using the Tanimoto distance measure (and therefore, the Sequential AccessSparseVector implementation). Also, my Hadoop cluster has 6 worker nodes (should the Hadoop-infrastructure influence the methodology for K-selection). :: My questions :: My main question here is: * How do I go about coming up with a value of K, given the sizes and nature of my dataset and the feature-space? I'm aware that this Wikipedia page lists a bunch of approaches for K-selection (en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set), specifying the motivation behind their synthesis for each. But before I go down the path of implementing one of those approaches, I'd love to hear from the users here who have had to do something similar: * How did you select K for your large-scale clustering scenario? * Also, was that K so large (w.r.t the number of available Hadoop worker nodes) that the native MapReduce implementation of Mahout K-means could not be used as-it-is due to running-time issues? If so, how did you come up with an alternative/augmenting approach to MapReduce-based K-means? Further, I have an idea in mind myself, for coming up with a ball-park value of K (and the K initial centroids). Would be great if I can be provided feedback as to whether it's worth pursuing (or if it's better to go with one of the approaches on the Wikipedia page referred to above; or with an altogether different approach that you used). The idea is this >> Since the notion of distance/similarity for my data is based on overlap of identically-sized symbol-sets, I can compute co-occurrence frequencies of the symbol n-sets (how many times a particular set of n symbols appears across all vectors of the dataset). I then identify the symbol n-sets with the highest co-occurrence frequencies (highest based on a cutoff/threshold for the percentile of these frequencies). This restricted set of symbol n-sets then gives me an estimate of what K _could_ be, and also some indication of how to choose those K initial centroids (I'm thinking I will just pick K datapoints/vectors at random such that they contain these high-frequency symbol n-sets.) Is this overkill/nonsensical for my purpose? As a final note, I do realize that using k-means with sparse vectors may not be a good idea (given that the centroid vectors themselves are going to be a lot denser than the datapoint vectors). However, I'm tempted to test it for my dataset's size since I'm getting decent clustering quality for much-smaller subsets of my overall corpus (yes, the K-selection issue exists there as well, but these subsets have been small enough to experiment with a range of K-values for each of them and reach convergence in a relatively small number of quick-running iterations; however, attempting to do the same with the overall dataset is a challenge, because each iteration is taking a really long time to run, even with K much smaller 18 million; hence I'm trying to avoid having to experiment with a range of K-values and hoping to use a more principled/tried-and-tested approach). Thanks very much in advance! - Harsh
