[mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

hsharma mailinglists Tue, 15 Sep 2015 14:59:22 -0700

Hello,


I have some questions around large-scale clustering. I would like to
arrive at a methodology that I can use to determine an appropriate
value of K to run K-means clustering for (at least for my scenario, if
not in general). More details follow below (apologies for the
verbosity, but I wanted to provide as much context as I could).

By 'large' I mean to imply:

* large in the number of points to cluster
* large in the dimensionality of the vector/feature-space
* large in the number of clusters


It would be doubly great if someone here has had to perform clustering
in a similar setting (similar in terms of the number of datapoints,
the nature/type of data being clustered, the number of clusters being
formed, etc.) and are willing to share their war story.

:: The context ::

I'm trying to cluster ~ 18 million sparse term-frequency vectors with
Mahout 0.9. Although these vectors were not derived from documents in
a text corpus, they were generated based on each data point being
associated with a finite number of discrete symbols. The size of the
overall symbol-set is quite large, hence the sparsity of the vector
representation. I assumed that I don't require idf-weighting because
none of these symbols can occur more than once per datapoint/vector.
Additionally, each vector has a fixed number of non-zero dimensions.

I'm using the seq2sparse program to generate these tf vectors from the
raw symbol-based inputs. For my case, the notion of
similarity/distance is based on the overlap between the symbol-sets of
the two vectors under consideration. Hence, I'm using the Tanimoto
distance measure (and therefore, the Sequential AccessSparseVector
implementation).

Also, my Hadoop cluster has 6 worker nodes (should the
Hadoop-infrastructure influence the methodology for K-selection).



:: My questions ::

My main question here is:

* How do I go about coming up with a value of K, given the sizes and
nature of my dataset and the feature-space?


I'm aware that this Wikipedia page lists a bunch of approaches for
K-selection 
(en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set),
specifying the motivation behind their synthesis for each. But before
I go down the path of implementing one of those approaches, I'd love
to hear from the users here who have had to do something similar:

* How did you select K for your large-scale clustering scenario?
* Also, was that K so large (w.r.t the number of available Hadoop
worker nodes) that the native MapReduce implementation of Mahout
K-means could not be used as-it-is due to running-time issues? If so,
how did you come up with an alternative/augmenting approach to
MapReduce-based K-means?


Further, I have an idea in mind myself, for coming up with a ball-park
value of K (and the K initial centroids). Would be great if I can be
provided feedback as to whether it's worth pursuing (or if it's better
to go with one of the approaches on the Wikipedia page referred to
above; or with an altogether different approach that you used). The
idea is this >>

Since the notion of distance/similarity for my data is based on
overlap of identically-sized symbol-sets, I can compute co-occurrence
frequencies of the symbol n-sets (how many times a particular set of n
symbols appears across all vectors of the dataset). I then identify
the symbol n-sets with the highest co-occurrence frequencies (highest
based on a cutoff/threshold for the percentile of these frequencies).
This restricted set of symbol n-sets then gives me an estimate of what
K _could_ be, and also some indication of how to choose those K
initial centroids (I'm thinking I will just pick K datapoints/vectors
at random such that they contain these high-frequency symbol n-sets.)
Is this overkill/nonsensical for my purpose?


As a final note, I do realize that using k-means with sparse vectors
may not be a good idea (given that the centroid vectors themselves are
going to be a lot denser than the datapoint vectors). However, I'm
tempted to test it for my dataset's size since I'm getting decent
clustering quality for much-smaller subsets of my overall corpus (yes,
the K-selection issue exists there as well, but these subsets have
been small enough to experiment with a range of K-values for each of
them and reach convergence in a relatively small number of
quick-running iterations; however, attempting to do the same with the
overall dataset is a challenge, because each iteration is taking a
really long time to run, even with K much smaller 18 million; hence
I'm trying to avoid having to experiment with a range of K-values and
hoping to use a more principled/tried-and-tested approach).



Thanks very much in advance!

- Harsh

[mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

Reply via email to