Could you perhaps load the results back into Pig and join the unnamed data with a relation which contains the names?
On Wed, Mar 6, 2013 at 1:58 AM, Colum Foley <[email protected]> wrote: > Hi, > > I am using the Elephant-Bird package to write feature vectors to > sequence files for use in clustering with Mahout. The problem I have > is that when I run K-Means clustering, I cannot decipher which item > has been assigned to which cluster. I asked a question on this forum > yesterday and the suggestion was to use NamedVectors to store the name > of the vector so that I can read this back in when reviewing the > results. > > When I print out the contents of the clusteredPoints/part-m-00000 > file, I see the following: > > 1.0: [9684:1.000] is in 1205002 > 1.0: [713:1.000, 1022:1.000, 5514:1.000, 9098:1.000] is in 1205002 > 1.0: [5414:1.000] is in 1205002 > 1.0: [5158:2.000] is in 1205002 > 1.0: [424:1.000] is in 1205002 > 1.0: [7460:3.000] is in 1205002 > > But I need a way to resolve the vector names. > > My question is: can Elephant-Bird store vectors in NamedVector format, > where the key is used as the name? > > If not can anyone suggest another way that I can resolve the names of > items that have been assigned to clusters in the absence of > NamedVectors? > > > Many thanks in advance, > Colum >
