Hi Wouter,

My thought would be to go for plan (b) (have not tested it though). This
would produce simply the sum of frequencies of the different terms (I'm
referring to a real multi-term query, not a phrase as you mentioned -
"the man" - which should work). 
The problem I see is that it you loose the ability to use boosts (I
assume this is fine by you). 

I don't see a problem here, (referring to "doesn't feel right"...) - you
simply want a different scoring - "just give me the damn frequency",
right? In that situation, you should disable all the idf, coord, norm
and sqrt manipulations that Lucene did in order to produce "smarter"
scores, which takes into account and balance other properties of the
query (different terms and their IDFs); the document (lengthNorm); the
index (IDF's); and behavior of frequencies (tf implementation as sqrt).
The frameworks makes these smarter adjustments possible, it does not
mean you need it in your case. 

Ziv

 

-----Original Message-----
From: W.H. van Atteveldt [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 20, 2006 7:05 AM
To: java-user@lucene.apache.org
Subject: Scoring purely on term frequencies

Dear list,

I am interested in using Lucene for analyzing documents based on
occurrence of certain keywords. As such, I am not interested in the
'top' or 'best' documents, but I do want to know exactly how many words
in the query matched.

Thus, instead of the complicated formula used by default, I really just
want to use Score(q,d) = Sum_{t in q} freq(q,d).

[Of course, if the query is "the man", I do not want to count 'the'
before man; since 'the' I think is a Term (right?), this does not quite
hold. I want to count every occurrence of the combination 'the man']

(a)
I tried extending a SimilarityDelegator(DefaultSimilarity) and make tf
return freq and coord,idf,*Norm return 1.0f. This worked but produced
scores like 0.61 (approx) and 0.5 where it should have returned 3 and 2
(on a simple test)

(b) 
I suppose I could extend Similarity itself but the documentation is
quite sketchy on which methods are actually used, and something like
coord or idf is simply meaningless in my case. I could return 1.0 like
above but somehow it doesn't feel right. That said, I haven't tried it
yet :-)

(c)
I could skip the Searcher and directly use the IndexReader. With simple
term queries this is trivial and works as expected, but I would like to
be able to use "the man" and "the article"~3 style queries. I could go
ahead and look at the positions, but it seems like someone should
already have implemented this before. Can anyone point me in the
direction of something that gives me a frequency if I give it a query
(rather than a term).

Any help greatly appreciated!

Wouter

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to