Thanks Mike - I spent a few hours tracing through the explain process last
night and could see all that and it looked like most was reachable without
having to alter core classes. The other thing I thought of since I'm doing
this as a one-time shot as messages come in (persisting aggregate counts) I
could segregate the term queries from the phrase queries and have a more
predictable collection of scorers. But then I might as well do an individual
search for each keyword. That seems a bit off too.

The basis of this function is to have near real-time performance of keywords
from incoming messages. Then we use those numbers for targeting. I index the
messages as they come in and then we can use all the great Lucene stuff for
searching and analysis after the fact. It's just the term/phrase thing
that's been frustrating me and I refuse to parse the output of explain. Just
something about that doesn't sit right. With a hundred vendors that could
have 30 keywords each, ouch.

Thanks again!

-David-

-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, January 26, 2012 8:44 AM
To: java-user@lucene.apache.org
Subject: Re: Query term counting, again...

You should be able to use the Scorer.visitSubScorers API?  You'd do this up
front, to recursively gather all "interesting" scorers in the Query, and
then in a custom collector, in the collect method, you can go and ask each
subScorer whether it matched the current document (call its .freq() and see
if that is > 0), I think?

This is very expert territory and not well explored... and there are certain
cases where it will fail because of how boolean scorers work... but it
should otherwise work and scale well.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 25, 2012 at 6:36 PM, David Olson <da...@proxemx.com> wrote:
> Hi all,
>
> After much code and forum searching, I've hit a frustrating point that 
> should be more obvious. I've trolled through a ton of postings and 
> messaging on keyword counting and it seems like all the examples cover 
> single word terms. I've got several code bits I've written that can 
> get me what I want from a single term perspective but I have queries 
> with several terms that also mix in phrases. Ultimately I'd like to 
> have output that says banana - 2 times, "chocolate chips" - 4 times, over
a course of 1000+ documents.
>
> Right now I walk through the query terms and match against the term 
> vectors from my hits. This, of course, makes the assumption chocolate 
> and chips are separate terms. Comparing positions seems like the only way.
>
> The frustrating point is that I see the 2 query types in the clauses 
> for the query. And, more annoying is that explain() does show what I 
> need and I haven't had a lot of luck backtracking what it's doing. 
> Spans didn't seem to help either.
>
> Any advice? I'm getting real good a single term counting :)
>
> -DO
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Query-term-counting-again-tp3689354
> p3689354.html Sent from the Lucene - Java Users mailing list archive 
> at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to