Eric,
Wildcard queries are tricky business. WildcardQuery by itself
without leveraging any analysis tricks is what you've got, but you
may want to consider injecting rotated tokens. For example, the word
cat would be indexed as "cat$", "at$c", "t$ca", and "$cat" (all in
the same position, increment 0). That's half the equation. The
other half is to adjust the queries so that if someone searches for
c*t that it becomes a WildcardQuery (or PrefixQuery in this case) for
t$c*, making the search space much smaller.
CSRQ definitely isn't what you want for wildcard queries. Another
alternative is to create a custom Filter, if its reasonable to
extract wildcarded clauses from a query expression, that can
enumerate terms as efficiently as possible (like WildcardTermEnum
does) and lights up only the documents that contain matching terms -
this would eliminate the TooManyClauses headache.
There really isn't anything pre-built that does what you're after any
better than the suggestions above, I don't think.
Erik
On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote:
OK, I know I'm asking you to write my code for me (or at least
point me to
an example), but I'm at my wits end, so please rescue me....
This is a reprise of TooManyClauses. We have a large amount of
text, and a
requirement to do a wildcard query. Of course, it's waaaay too big
to use
Wildcard or the other "expanding" queries. They frighten me
anyway.....
y'all pointed me at the ConstantScoreRangeQuery (CSRQ), but
actually using
it is not making sense to me.
I just don't get how, for instance, CSRQ helps me that much. Say I
want to
search for big*er. I can use a CSRQ to get all the docs that
include this
term, just by using biga and bigz as my min/max terms. But then I'm
stuck. I
could iterate through all the docs returned, but that seems
inefficient. Not
to mention that the HitCollector (?) class warns against this due
to "an
order of magnitude" decrease in response time.
What I *want* is a way to, for each doc in the CSRQ, get to answer
whether
it's a match. Really, on the order of a callback with the value
that worked
for the CSRQ and the ability to return a yes/no or a ranking.
Again, I can
interate all the docs matched, but this seems expensive.
Using filters doesn't really seem to do the trick for me either. If I
understand them properly, they allow me to set up a bitset for all the
documents that should be searched. All 1,000,000 of them? Or am I
thinking
about this completely backwards? I have LIA, but I'm also wondering if
there's something in 1.9 that I haven't found yet.
Now, given how easy the rest of Lucene is to use, I assume that I'm
approaching this poorly, but I sure am stumped.
All that said, I'm quite Java-naieve, so please bear with me if this
question demonstrates my ignorance painfully.....
Thanks
Erick
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]