Eric,

Wildcard queries are tricky business. WildcardQuery by itself without leveraging any analysis tricks is what you've got, but you may want to consider injecting rotated tokens. For example, the word cat would be indexed as "cat$", "at$c", "t$ca", and "$cat" (all in the same position, increment 0). That's half the equation. The other half is to adjust the queries so that if someone searches for c*t that it becomes a WildcardQuery (or PrefixQuery in this case) for t$c*, making the search space much smaller.

CSRQ definitely isn't what you want for wildcard queries. Another alternative is to create a custom Filter, if its reasonable to extract wildcarded clauses from a query expression, that can enumerate terms as efficiently as possible (like WildcardTermEnum does) and lights up only the documents that contain matching terms - this would eliminate the TooManyClauses headache.

There really isn't anything pre-built that does what you're after any better than the suggestions above, I don't think.

        Erik


On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote:

OK, I know I'm asking you to write my code for me (or at least point me to
an example), but I'm at my wits end, so please rescue me....

This is a reprise of TooManyClauses. We have a large amount of text, and a requirement to do a wildcard query. Of course, it's waaaay too big to use Wildcard or the other "expanding" queries. They frighten me anyway.....

y'all pointed me at the ConstantScoreRangeQuery (CSRQ), but actually using
it is not making sense to me.

I just don't get how, for instance, CSRQ helps me that much. Say I want to search for big*er. I can use a CSRQ to get all the docs that include this term, just by using biga and bigz as my min/max terms. But then I'm stuck. I could iterate through all the docs returned, but that seems inefficient. Not to mention that the HitCollector (?) class warns against this due to "an
order of magnitude" decrease in response time.

What I *want* is a way to, for each doc in the CSRQ, get to answer whether it's a match. Really, on the order of a callback with the value that worked for the CSRQ and the ability to return a yes/no or a ranking. Again, I can
interate all the docs matched, but this seems expensive.

Using filters doesn't really seem to do the trick for me either. If I
understand them properly, they allow me to set up a bitset for all the
documents that should be searched. All 1,000,000 of them? Or am I thinking
about this completely backwards? I have LIA, but I'm also wondering if
there's something in 1.9 that I haven't found yet.

Now, given how easy the rest of Lucene is to use, I assume that I'm
approaching this poorly, but I sure am stumped.

All that said, I'm quite Java-naieve, so please bear with me if this
question demonstrates my ignorance painfully.....

Thanks
Erick


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to