Couple of thoughts inline... On Jul 22, 2010, at 10:44 PM, Xaida wrote:
> > Hi all! > > hmmm, i need to get how important is the word in entire document collection > that is indexed in the lucene index. I need to extract some "representable > words", lets say concepts that are common and can be representable to whole > collection. Or collection "keywords". I did the fulltext indexing and the > only field i am using are text contents, because titles of the documents are > mostly not representable(numbers, codes etc....) > > So, if i calculate tfidf, it gives me importance of single term with respect > to single document. TF gives you the importance in a single document. IDF gives you the inverse of importance across the collection > But if that word is repeating in the documents, how can > i calculate its total importance within index? Also, Lucene can also normalize by length, which is often a part of these things too. This information can be retrieved from TermDocs, TermEnum, etc. Also, as a related item, you may be interested in important phrases, which can often be more helpful. Check out https://cwiki.apache.org/confluence/display/MAHOUT/Collocations for one way of doing that. -Grant --------------------- Grant Ingersoll http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org