FYI: The mailing list handler strips attachments.
At any rate, sounds like an interesting project. I don't know how
easy it will be for you to implement 7 variants of VSM in Lucene given
the nature of the APIs, but if you do, it might be handy to see your
changes as a patch. :-) Also not quite sure what all those variants
will help with when it comes to your broader goal, but that isn't for
me to decide :-) Seems like your goal is to find the traceability
stuff, not see if you can figure out how to change Lucene's
similarity! To that end, my two cents would be to focus on creating
the right kinds of queries, analyzers, etc.
-Grant
On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:
Thanks for your tips. My overall goal is to quickly implement 7
variants of
vector space model using Lucene. You can find these variants in the
updloaded file.
I am doing all these stuffs for a much broader goal: I am trying to
recover
traceability links from requirements to source code files. I treat
every
requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.
Grant Ingersoll-6 wrote:
On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
Thanks for the reply. Sorry if my explanation is not clear. Yes, you
are
correct the model is based on Salton's VSM. However, the
calculation of the
term weight and the doc norm is, in my opinion, different from
Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they
calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the
interfaces of
Similarity and DefaultSimilairty class. I place it below:
public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
}
You can see that this lengthNorm for a doc is quite different from
that
website norm calculation.
The lengthNorm method is different from the IDF calculation. In the
Similarity class, that is handled by the idf() method. Length norm
is
an attempt to address one of the limitations listed further down in
that paper:
"Long Documents: Very long documents make similarity measures
difficult (vectors with small dot products and high dimensionality)"
Similarly, the querynorm interface of DefaultSimilarity class is:
/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
public float queryNorm(float sumOfSquaredWeights) {
return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}
This is again different the website model.
Query norm is an attempt to allow for comparison of scores across
queries, but I don't think one should do that anyway.
I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as <code>sqrt(freq)</code>. */
public float tf(float freq) {
return (float)Math.sqrt(freq);
}
These are all callback methods from within the Scorer classes that
each Query uses. Have a look at TermScorer for how these things get
called.
Try this as an example:
Setup a really simple index with 1 or 2 docs each with a few words.
Setup a simple Similarity class where you override all of these
methods to return 1 (or some simple default)
and then index your documents and do a few queries.
Then, have a look at Searcher.explain() to see why a document scores
the way it does. Then, you can work to modify from there.
Here's the bigger question: what is your ultimate goal here? Are
you
just trying to understand Lucene at an academic/programming level or
do you have something you are trying to achieve in terms of
relevance?
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
--
View this message in context:
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]