Hi Matt, Thanks for the reply. I am using the Elephant-Bird package in order to generate my vectors, so I am not sure if I can specify to use NamedVectors. I have asked this in a separate thread.
In the absence of NamedVectors, is there another way I can resolve the name of the items that you know of? Currently when I print out the contents of the clusteredPoints file I see the following output: 1.0: [3887:3.000, 9441:1.000] is in 1205002 1.0: [6773:1.000] is in 1205002 1.0: [8987:2.000] is in 1205002 1.0: [2956:1.000] is in 1205002 Thanks again, Colum On Tue, Mar 5, 2013 at 8:57 PM, Matt Molek <[email protected]> wrote: > If you run kmeans with the "-cl" option (or set the runClustering option to > true if you're calling the driver from Java code), you'll get a sequence > file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable key > identifying the cluster, and a WeightedVectorWritable with a pdf weight > (always 1.0 in kmeans) and your original vector. If you want to recover the > name/id/whatever of the original input, you need to use NamedVector as > input to kmeans. The name will be preserved in the vector. > > Here's one abbreviated line of output. My vector with name "0" was > classified into cluster 4398: > Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...] > > Clusterdump might include this information as well. I can't remember. You'd > still need to run kmeans with the -cl option. > > > On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <[email protected]> wrote: > >> Hi, >> >> I have a simple enough question: having run K-Means clustering >> (generated the clustered points, and clusters-x, clusters-x-final >> directories), how do you identify which items were clustered together? >> Apologies if this is trivial but I could not see an obvious answer in >> the documentation. >> >> Clusterdump seems to be the tool to use, but when I have run it I only >> see Cluster ids,centroid values, radius etc, but it is not obvious to >> me how I resolve individual item names? I am looking for something of >> the following form: >> >> cluster_id = (keys)* >> >> for example: >> >> cluster_1 = {"user104x","user89dc","user22da".} >> cluster_2 = {"user19c","user11c",....} >> >> >> Thanks, >> Colum >>
