Have a look at Mahout (Lucene sister project), which can create SparseVectors from Lucene term vectors where the entries are the term id and the "weight" of the term. Trivial to replicate what is done in Mahout for LibSVM or ARFF or whatever.
On Jan 18, 2010, at 9:07 AM, Solt, Illés wrote: > Hi, > > I am looking for a way to represent term frequency data in a vector space, > thus using unique integer identifiers instead of string. This would allow > feeding tools like LIBSVM from a Lucene index. > > A small example: TermFreqVector.toString() produces "{TITLE: one/3, two/4}". > What I am looking for is "1:3 2:4", where 1 and 2 are arbitrary identifiers, > sortedness is not an issue. > > The task can obviously be solved using some java Map, but it should be less > efficient then using native Lucene methods. > > I am using 2.9.1, my index can be considered constant. > > > Thanks, > Illes Solt > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org