Re: How to get Term Weights (document term matrix)?

Soeren Pekrul Sat, 04 Nov 2006 01:53:42 -0800

Chris Hostetter wrote:

You really, *REALLY* don't wnat to be doing this using the "Hits" class
like in your example ...
   1) this will re-execute your search behind the scenes many many times
   2) the scores returnd by "Hits" are psuedo-normalized ... they will be
      meaningless for any sort of comparison.


Thank you very much Hoss.

if your concern is making sure that the score you get back matches the
score you would get from executing a search even if you change the
Similarity, you could just make sure you use the lengthNorm and tf

functions from the SImilarity class just like TermScorer does

That sounds very good. The term frequency and the document frequency canI get from the IndexReader. The number of tokens in a field (numTokens)for the Similarity.lengthNorm function can I get from the term vector(TermFreqVector) or I use the IndexReader.norms(String field).

The usage of TermQuery in my previous example is a simplification. Thedocuments of my collection have some fields like title, abstract orkeywords. The term weights in my document term matrix should include allfields of a document for a word (token). So I used in reality aBooleanQuery that combines the possible TermQueries for a word.Of-course, I can sum the field weights of a term.

... or you
could keep executing a TermQuery for each term like you are now, but using
a HitCollector so you get the raw score)

take a look at the Searcher.search methods that take in a HitCollector.

That seems to be the easiest way for my BooleanQuery. I will start withthis and change my current implementation.


Have a nice weekend.

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to get Term Weights (document term matrix)?

Reply via email to