If you run kmeans with the "-cl" option (or set the runClustering option to
true if you're calling the driver from Java code), you'll get a sequence
file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable key
identifying the cluster, and a WeightedVectorWritable with a pdf weight
(always 1.0 in kmeans) and your original vector. If you want to recover the
name/id/whatever of the original input, you need to use NamedVector as
input to kmeans. The name will be preserved in the vector.
Here's one abbreviated line of output. My vector with name "0" was
classified into cluster 4398:
Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...]
Clusterdump might include this information as well. I can't remember. You'd
still need to run kmeans with the -cl option.
On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <[email protected]> wrote:
> Hi,
>
> I have a simple enough question: having run K-Means clustering
> (generated the clustered points, and clusters-x, clusters-x-final
> directories), how do you identify which items were clustered together?
> Apologies if this is trivial but I could not see an obvious answer in
> the documentation.
>
> Clusterdump seems to be the tool to use, but when I have run it I only
> see Cluster ids,centroid values, radius etc, but it is not obvious to
> me how I resolve individual item names? I am looking for something of
> the following form:
>
> cluster_id = (keys)*
>
> for example:
>
> cluster_1 = {"user104x","user89dc","user22da".}
> cluster_2 = {"user19c","user11c",....}
>
>
> Thanks,
> Colum
>