Re: Vector Space Model: New Similarity Implementation Issues

Dharmalingam Thu, 28 Feb 2008 14:19:27 -0800

You can find those variants of the vector space model in this interesting
article:
http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976


Now, I got confirmed with you the current nature of Similarity API's will be
not easy to quickly realize these variants.

Actually, I implemented the earlier web-site model as a separate Java
program, which uses Lucene classes, but not through inherting the Similarity
class. It appears inherting similarity class will not solve my problem of
realization these variant


Grant Ingersoll-6 wrote:
> 
> FYI: The mailing list handler strips attachments.
> 
> At any rate, sounds like an interesting project.  I don't know how  
> easy it will be for you to implement 7 variants of VSM in Lucene given  
> the nature of the APIs, but if you do, it might be handy to see your  
> changes as a patch.  :-)  Also not quite sure what all those variants  
> will help with when it comes to your broader goal, but that isn't for  
> me to decide :-)  Seems like your goal is to find the traceability  
> stuff, not see if you can figure out how to change Lucene's  
> similarity!  To that end, my two cents would be to focus on creating  
> the right kinds of queries, analyzers, etc.
> 
> 
> -Grant
> 
> On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:
> 
>>
>> Thanks for your tips. My overall goal is to quickly implement 7  
>> variants of
>> vector space model using Lucene. You can find these variants in the
>> updloaded file.
>>
>> I am doing all these stuffs for a much broader goal: I am trying to  
>> recover
>> traceability links from requirements to source code files. I treat  
>> every
>> requirement as a query. In this problem, I would like to compare these
>> collection of algorithms for their relevance.
>>
>>
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>>
>>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>>
>>>>
>>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>>> are
>>>> correct the model is based on  Salton's VSM. However, the
>>>> calculation of the
>>>> term weight and the doc norm is, in my opinion, different from
>>>> Lucene. If
>>>> you look at the table given in
>>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>>> calcuate the
>>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>>> interfaces of
>>>> Similarity and DefaultSimilairty class. I place it below:
>>>>
>>>> public float lengthNorm(String fieldName, int numTerms) {
>>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>>> }
>>>>
>>>> You can see that this lengthNorm for a doc is quite different from
>>>> that
>>>> website norm calculation.
>>>
>>> The lengthNorm method is different from the IDF calculation.  In the
>>> Similarity class, that is handled by the idf() method.  Length norm  
>>> is
>>> an attempt to address one of the limitations listed further down in
>>> that paper:
>>> "Long Documents: Very long documents make similarity measures
>>> difficult (vectors with small dot products and high dimensionality)"
>>>
>>>
>>>
>>>>
>>>>
>>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>>
>>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>>> public float queryNorm(float sumOfSquaredWeights) {
>>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>>> }
>>>>
>>>> This is again different the website model.
>>>
>>> Query norm is an attempt to allow for comparison of scores across
>>> queries, but I don't think one should do that anyway.
>>>
>>>
>>>>
>>>>
>>>> I also have difficulities with tf interface of DefaultSimilarity:
>>>> /** Implemented as <code>sqrt(freq)</code>. */
>>>> public float tf(float freq) {
>>>>   return (float)Math.sqrt(freq);
>>>> }
>>>>
>>>
>>> These are all callback methods from within the Scorer classes that
>>> each Query uses.  Have a look at TermScorer for how these things get
>>> called.
>>>
>>>
>>> Try this as an example:
>>>
>>> Setup a really simple index with 1 or 2 docs each with a few words.
>>> Setup a simple Similarity class where you override all of these
>>> methods to return 1 (or some simple default)
>>> and then index your documents and do a few queries.
>>>
>>> Then, have a look at Searcher.explain() to see why a document scores
>>> the way it does.  Then, you can work to modify from there.
>>>
>>> Here's the bigger question:  what is your ultimate goal here?  Are  
>>> you
>>> just trying to understand Lucene at an academic/programming level or
>>> do you have something you are trying to achieve in terms of  
>>> relevance?
>>>
>>> -Grant
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
>> -- 
>> View this message in context:
>> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15747395.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector Space Model: New Similarity Implementation Issues

Reply via email to