You can find those variants of the vector space model in this interesting article: http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976
Now, I got confirmed with you the current nature of Similarity API's will be not easy to quickly realize these variants. Actually, I implemented the earlier web-site model as a separate Java program, which uses Lucene classes, but not through inherting the Similarity class. It appears inherting similarity class will not solve my problem of realization these variant Grant Ingersoll-6 wrote: > > FYI: The mailing list handler strips attachments. > > At any rate, sounds like an interesting project. I don't know how > easy it will be for you to implement 7 variants of VSM in Lucene given > the nature of the APIs, but if you do, it might be handy to see your > changes as a patch. :-) Also not quite sure what all those variants > will help with when it comes to your broader goal, but that isn't for > me to decide :-) Seems like your goal is to find the traceability > stuff, not see if you can figure out how to change Lucene's > similarity! To that end, my two cents would be to focus on creating > the right kinds of queries, analyzers, etc. > > > -Grant > > On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote: > >> >> Thanks for your tips. My overall goal is to quickly implement 7 >> variants of >> vector space model using Lucene. You can find these variants in the >> updloaded file. >> >> I am doing all these stuffs for a much broader goal: I am trying to >> recover >> traceability links from requirements to source code files. I treat >> every >> requirement as a query. In this problem, I would like to compare these >> collection of algorithms for their relevance. >> >> >> >> >> Grant Ingersoll-6 wrote: >>> >>> >>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote: >>> >>>> >>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you >>>> are >>>> correct the model is based on Salton's VSM. However, the >>>> calculation of the >>>> term weight and the doc norm is, in my opinion, different from >>>> Lucene. If >>>> you look at the table given in >>>> http://www.miislita.com/term-vector/term-vector-3.html, they >>>> calcuate the >>>> document norm based on the weight wi=tfi*idfi. I looked at the >>>> interfaces of >>>> Similarity and DefaultSimilairty class. I place it below: >>>> >>>> public float lengthNorm(String fieldName, int numTerms) { >>>> return (float)(1.0 / Math.sqrt(numTerms)); >>>> } >>>> >>>> You can see that this lengthNorm for a doc is quite different from >>>> that >>>> website norm calculation. >>> >>> The lengthNorm method is different from the IDF calculation. In the >>> Similarity class, that is handled by the idf() method. Length norm >>> is >>> an attempt to address one of the limitations listed further down in >>> that paper: >>> "Long Documents: Very long documents make similarity measures >>> difficult (vectors with small dot products and high dimensionality)" >>> >>> >>> >>>> >>>> >>>> Similarly, the querynorm interface of DefaultSimilarity class is: >>>> >>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */ >>>> public float queryNorm(float sumOfSquaredWeights) { >>>> return (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); >>>> } >>>> >>>> This is again different the website model. >>> >>> Query norm is an attempt to allow for comparison of scores across >>> queries, but I don't think one should do that anyway. >>> >>> >>>> >>>> >>>> I also have difficulities with tf interface of DefaultSimilarity: >>>> /** Implemented as <code>sqrt(freq)</code>. */ >>>> public float tf(float freq) { >>>> return (float)Math.sqrt(freq); >>>> } >>>> >>> >>> These are all callback methods from within the Scorer classes that >>> each Query uses. Have a look at TermScorer for how these things get >>> called. >>> >>> >>> Try this as an example: >>> >>> Setup a really simple index with 1 or 2 docs each with a few words. >>> Setup a simple Similarity class where you override all of these >>> methods to return 1 (or some simple default) >>> and then index your documents and do a few queries. >>> >>> Then, have a look at Searcher.explain() to see why a document scores >>> the way it does. Then, you can work to modify from there. >>> >>> Here's the bigger question: what is your ultimate goal here? Are >>> you >>> just trying to understand Lucene at an academic/programming level or >>> do you have something you are trying to achieve in terms of >>> relevance? >>> >>> -Grant >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf >> -- >> View this message in context: >> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > -------------------------- > Grant Ingersoll > http://www.lucenebootcamp.com > Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15747395.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]