Thanks Mike - I spent a few hours tracing through the explain process last night and could see all that and it looked like most was reachable without having to alter core classes. The other thing I thought of since I'm doing this as a one-time shot as messages come in (persisting aggregate counts) I could segregate the term queries from the phrase queries and have a more predictable collection of scorers. But then I might as well do an individual search for each keyword. That seems a bit off too.
The basis of this function is to have near real-time performance of keywords from incoming messages. Then we use those numbers for targeting. I index the messages as they come in and then we can use all the great Lucene stuff for searching and analysis after the fact. It's just the term/phrase thing that's been frustrating me and I refuse to parse the output of explain. Just something about that doesn't sit right. With a hundred vendors that could have 30 keywords each, ouch. Thanks again! -David- -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, January 26, 2012 8:44 AM To: java-user@lucene.apache.org Subject: Re: Query term counting, again... You should be able to use the Scorer.visitSubScorers API? You'd do this up front, to recursively gather all "interesting" scorers in the Query, and then in a custom collector, in the collect method, you can go and ask each subScorer whether it matched the current document (call its .freq() and see if that is > 0), I think? This is very expert territory and not well explored... and there are certain cases where it will fail because of how boolean scorers work... but it should otherwise work and scale well. Mike McCandless http://blog.mikemccandless.com On Wed, Jan 25, 2012 at 6:36 PM, David Olson <da...@proxemx.com> wrote: > Hi all, > > After much code and forum searching, I've hit a frustrating point that > should be more obvious. I've trolled through a ton of postings and > messaging on keyword counting and it seems like all the examples cover > single word terms. I've got several code bits I've written that can > get me what I want from a single term perspective but I have queries > with several terms that also mix in phrases. Ultimately I'd like to > have output that says banana - 2 times, "chocolate chips" - 4 times, over a course of 1000+ documents. > > Right now I walk through the query terms and match against the term > vectors from my hits. This, of course, makes the assumption chocolate > and chips are separate terms. Comparing positions seems like the only way. > > The frustrating point is that I see the 2 query types in the clauses > for the query. And, more annoying is that explain() does show what I > need and I haven't had a lot of luck backtracking what it's doing. > Spans didn't seem to help either. > > Any advice? I'm getting real good a single term counting :) > > -DO > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Query-term-counting-again-tp3689354 > p3689354.html Sent from the Lucene - Java Users mailing list archive > at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org