On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:

Hi,
It's my first experiment with Lucene. Please help me.
I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF. After that I want to compute the cosine similarity between all documents and produce a doc-doc similarity matrix. My document set is large and it's important to have a scalable implementation.


See Mahout (http://lucene.apache.org/mahout). In the utils module, is a class called LuceneIterable that the o.a.mahout.utils.vectors.Driver program can use to convert a Lucene index into a Mahout Vector representation, which can then be used to create a d-d similarity matrix. It uses Hadoop, so you can go as big as you want.

See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to