Re: Vector Space Model: New Similarity Implementation Issues

Grant Ingersoll Thu, 28 Feb 2008 14:09:36 -0800

FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project. I don't know howeasy it will be for you to implement 7 variants of VSM in Lucene giventhe nature of the APIs, but if you do, it might be handy to see yourchanges as a patch. :-) Also not quite sure what all those variantswill help with when it comes to your broader goal, but that isn't forme to decide :-) Seems like your goal is to find the traceabilitystuff, not see if you can figure out how to change Lucene'ssimilarity! To that end, my two cents would be to focus on creatingthe right kinds of queries, analyzers, etc.



-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:

Thanks for your tips. My overall goal is to quickly implement 7variants of

vector space model using Lucene. You can find these variants in the
updloaded file.

I am doing all these stuffs for a much broader goal: I am trying torecovertraceability links from requirements to source code files. I treatevery

requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.




Grant Ingersoll-6 wrote:



On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:


Thanks for the reply. Sorry if my explanation is not clear. Yes, you
are
correct the model is based on  Salton's VSM. However, the
calculation of the
term weight and the doc norm is, in my opinion, different from
Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they
calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the
interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
  return (float)(1.0 / Math.sqrt(numTerms));
}

You can see that this lengthNorm for a doc is quite different from
that
website norm calculation.


The lengthNorm method is different from the IDF calculation.  In the

Similarity class, that is handled by the idf() method. Length normis

an attempt to address one of the limitations listed further down in
that paper:
"Long Documents: Very long documents make similarity measures
difficult (vectors with small dot products and high dimensionality)"



Similarly, the querynorm interface of DefaultSimilarity class is:

/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
public float queryNorm(float sumOfSquaredWeights) {
  return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}

This is again different the website model.


Query norm is an attempt to allow for comparison of scores across
queries, but I don't think one should do that anyway.



I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as <code>sqrt(freq)</code>. */
public float tf(float freq) {
  return (float)Math.sqrt(freq);
}


These are all callback methods from within the Scorer classes that
each Query uses.  Have a look at TermScorer for how these things get
called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.
Setup a simple Similarity class where you override all of these
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores
the way it does.  Then, you can work to modify from there.

Here's the bigger question: what is your ultimate goal here? Areyou

just trying to understand Lucene at an academic/programming level or

do you have something you are trying to achieve in terms ofrelevance?


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
--
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector Space Model: New Similarity Implementation Issues

Reply via email to