I'm still a bit of a beginner withMahout, but as far as I know,
NamedVectors are the only way to preserve that sort of data though the
clustering algorithm. They're easy to construct if you need to do it
yourself. Just pass your current vector object and whatever name you want
into the constructor, NamedVector(Vector delegate, String name)
>From your original email, you're looking to get something along the lines
of: cluster_1 = {"user104x","user89dc","user22da"...}
Is that name/id data coming out of Elephant-Bird somewhere along with the
vectors? As the key of a seqfile, with the vector as the value maybe? I
haven't used Elephant-Bird before. If that's the case, it would be a simple
Map only Hadoop job to convert all your vectors to NamedVectors. The map
method would be something along the lines of:
public void map(Text key, VectorWritable value, Context context)
{
value.set(new NamedVector(value.get(),key.toString());
context.write(key, value)
}
On Wed, Mar 6, 2013 at 5:02 AM, Colum Foley <[email protected]> wrote:
> Hi Matt,
>
> Thanks for the reply. I am using the Elephant-Bird package in order to
> generate my vectors, so I am not sure if I can specify to use
> NamedVectors. I have asked this in a separate thread.
>
> In the absence of NamedVectors, is there another way I can resolve the
> name of the items that you know of?
>
> Currently when I print out the contents of the clusteredPoints file I
> see the following output:
>
> 1.0: [3887:3.000, 9441:1.000] is in 1205002
> 1.0: [6773:1.000] is in 1205002
> 1.0: [8987:2.000] is in 1205002
> 1.0: [2956:1.000] is in 1205002
>
>
> Thanks again,
> Colum
>
> On Tue, Mar 5, 2013 at 8:57 PM, Matt Molek <[email protected]> wrote:
> > If you run kmeans with the "-cl" option (or set the runClustering option
> to
> > true if you're calling the driver from Java code), you'll get a sequence
> > file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable
> key
> > identifying the cluster, and a WeightedVectorWritable with a pdf weight
> > (always 1.0 in kmeans) and your original vector. If you want to recover
> the
> > name/id/whatever of the original input, you need to use NamedVector as
> > input to kmeans. The name will be preserved in the vector.
> >
> > Here's one abbreviated line of output. My vector with name "0" was
> > classified into cluster 4398:
> > Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...]
> >
> > Clusterdump might include this information as well. I can't remember.
> You'd
> > still need to run kmeans with the -cl option.
> >
> >
> > On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> I have a simple enough question: having run K-Means clustering
> >> (generated the clustered points, and clusters-x, clusters-x-final
> >> directories), how do you identify which items were clustered together?
> >> Apologies if this is trivial but I could not see an obvious answer in
> >> the documentation.
> >>
> >> Clusterdump seems to be the tool to use, but when I have run it I only
> >> see Cluster ids,centroid values, radius etc, but it is not obvious to
> >> me how I resolve individual item names? I am looking for something of
> >> the following form:
> >>
> >> cluster_id = (keys)*
> >>
> >> for example:
> >>
> >> cluster_1 = {"user104x","user89dc","user22da".}
> >> cluster_2 = {"user19c","user11c",....}
> >>
> >>
> >> Thanks,
> >> Colum
> >>
>