Re: A simple Vector Space Model and TFIDF usage

Grant Ingersoll Tue, 30 Jun 2009 09:13:55 -0700


On Jun 29, 2009, at 3:10 PM, Amir Hossein Jadidinejad wrote:

Hi,
It's my first experiment with Lucene. Please help me.
I'm going to index a set of documents and create a feature vectorfor each of them. This vector contains all terms belong to thedocument that weight using TFIDF.After that I want to compute the cosine similarity between alldocuments and produce a doc-doc similarity matrix. My document setis large and it's important to have a scalable implementation.

See Mahout (http://lucene.apache.org/mahout). In the utils module, isa class called LuceneIterable that the o.a.mahout.utils.vectors.Driverprogram can use to convert a Lucene index into a Mahout Vectorrepresentation, which can then be used to create a d-d similaritymatrix. It uses Hadoop, so you can go as big as you want.


See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: A simple Vector Space Model and TFIDF usage

Reply via email to