Re: Counting term frequency without using Explanation

Erick Erickson Wed, 07 Feb 2007 06:49:06 -0800

Before you go too far down this path, please consider what a "hit" is. It's
more complicated than you think <G>.


If all you want to do is count up the number of times any term appears in
the document, it's not too hard. You should be able to use a
termenum/termdocs process to count them.

TermDocs should work, just seek to a term, skip to the document number
(which you'll have to get somewhere else), and keep adding to your count
while the docid is the same as your target. Repeat for each term.

But it's a much more complicated story if you want to accurately reflect a
query. For instance, consider a near query, that is terms within, say, 3 of
each other. If you do something like the above, you'll present "hits" that
aren't real. For instance...

a b c d e f g h i j a

if you search for a and c within 3 of each other, is this one hit? two? it
definitely isn't three which is what you'd get if you just counted the
occurrence of the terms a, b... What about a NOT clause? How does a phrase
query get counted?

There have been several discussions of various aspects of this issue, but
often in the context of highlighting. You'll probably get some good
information from the following threads...

Counting terms' hits from phrases
Counting hits in a document

as well as searching the archive on highlighting and/or hitcount

Best
Erick




On 2/7/07, csahat <[EMAIL PROTECTED]> wrote:


Hi all,

  I'm so sorry if this question already answered before in this list, but
I
already search
the list, and I couldn't find the answer.

   This is what I want to do :

  When the user type in the query, for example "WebSphere Java",
Lucene will show not only the score, but showing the term count per
document
as well, like this

  doc1    0.8333          websphere=3, Java = 2
  doc2    0.817            websphere=2, Java=2


  I already tried to implement with TermFreqVector, but TermFreqVector
show
all the
terms in the field, instead what I want is only the terms that happen in
the
query.
I already tried using TermDocs as well, but it always gave result 0.

  I tried using Explanation class, using toString method, but I have to
"clean"
the information.


  Is there any "direct" way to do this in Lucene ?  Or perhaps someone can
give me a hint ?

  Thanks in advance

Re: Counting term frequency without using Explanation

Reply via email to