Others can correct me if I am wrong, but I don't think a "pure" Rochio
feedback loop is possible in the current state, since Lucene doesn't
currently support negative boosts
(http://lucene.apache.org/java/docs/queryparsersyntax.html). Having
said that, what we do, in a nutshell is similar to what you describe:
For the positive examples, store the terms and a boost factor. The
boost factor is the frequency of the term across all the positive
examples multiplied by beta.
Then for the negative examples, decrement the boost factor by gamma
times the frequency of the term in all the negative examples. Remove
any terms that have a boost of zero or less.
In the end, you construct a new query out of the terms and boosts that
you can submit. I think it is more of an approximation of Rochio, but
have had good results from it. You also probably want to limit the
number of terms per document you add, at least if you are concerned
about performance.
-Grant
Stefan Gusenbauer wrote:
I've some thoughts about Lucene and Relevance Feedback. I want to
implement some variation of the Roccio Formula and there is the problem.
The formula is like this:
Query(new) = alpha * Query(old) + beta * Sum(Relevant Documents) -
gamma * Sum(Non Relevant Documents)
The relevant documents in this formula should be in a vector
representation. This is the problem If I work with TermFreqVectors
then the vectors are not equally long and contains different terms. My
solution now is to take the TermFreqVectors and minimize them to the
least common multiple and perform then the computation.
So my questions are:
Is this the only way to do so? ( I hope so not)
Is there an add on for lucene to get a real vector representation?
Does anyone has experiences with this issue?
Thanks
Stefan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
-------------------------------------------------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
337 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]