Hi Wouter, My thought would be to go for plan (b) (have not tested it though). This would produce simply the sum of frequencies of the different terms (I'm referring to a real multi-term query, not a phrase as you mentioned - "the man" - which should work). The problem I see is that it you loose the ability to use boosts (I assume this is fine by you).
I don't see a problem here, (referring to "doesn't feel right"...) - you simply want a different scoring - "just give me the damn frequency", right? In that situation, you should disable all the idf, coord, norm and sqrt manipulations that Lucene did in order to produce "smarter" scores, which takes into account and balance other properties of the query (different terms and their IDFs); the document (lengthNorm); the index (IDF's); and behavior of frequencies (tf implementation as sqrt). The frameworks makes these smarter adjustments possible, it does not mean you need it in your case. Ziv -----Original Message----- From: W.H. van Atteveldt [mailto:[EMAIL PROTECTED] Sent: Saturday, May 20, 2006 7:05 AM To: java-user@lucene.apache.org Subject: Scoring purely on term frequencies Dear list, I am interested in using Lucene for analyzing documents based on occurrence of certain keywords. As such, I am not interested in the 'top' or 'best' documents, but I do want to know exactly how many words in the query matched. Thus, instead of the complicated formula used by default, I really just want to use Score(q,d) = Sum_{t in q} freq(q,d). [Of course, if the query is "the man", I do not want to count 'the' before man; since 'the' I think is a Term (right?), this does not quite hold. I want to count every occurrence of the combination 'the man'] (a) I tried extending a SimilarityDelegator(DefaultSimilarity) and make tf return freq and coord,idf,*Norm return 1.0f. This worked but produced scores like 0.61 (approx) and 0.5 where it should have returned 3 and 2 (on a simple test) (b) I suppose I could extend Similarity itself but the documentation is quite sketchy on which methods are actually used, and something like coord or idf is simply meaningless in my case. I could return 1.0 like above but somehow it doesn't feel right. That said, I haven't tried it yet :-) (c) I could skip the Searcher and directly use the IndexReader. With simple term queries this is trivial and works as expected, but I would like to be able to use "the man" and "the article"~3 style queries. I could go ahead and look at the positions, but it seems like someone should already have implemented this before. Can anyone point me in the direction of something that gives me a frequency if I give it a query (rather than a term). Any help greatly appreciated! Wouter --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]