Re: I just don't get wildcards at all.

Erik Hatcher Sat, 08 Apr 2006 04:05:29 -0700

Eric,

Wildcard queries are tricky business. WildcardQuery by itselfwithout leveraging any analysis tricks is what you've got, but youmay want to consider injecting rotated tokens. For example, the wordcat would be indexed as "cat$", "at$c", "t$ca", and "$cat" (all inthe same position, increment 0). That's half the equation. Theother half is to adjust the queries so that if someone searches forc*t that it becomes a WildcardQuery (or PrefixQuery in this case) fort$c*, making the search space much smaller.

CSRQ definitely isn't what you want for wildcard queries. Anotheralternative is to create a custom Filter, if its reasonable toextract wildcarded clauses from a query expression, that canenumerate terms as efficiently as possible (like WildcardTermEnumdoes) and lights up only the documents that contain matching terms -this would eliminate the TooManyClauses headache.

There really isn't anything pre-built that does what you're after anybetter than the suggestions above, I don't think.


        Erik


On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote:

OK, I know I'm asking you to write my code for me (or at leastpoint me to
an example), but I'm at my wits end, so please rescue me....
This is a reprise of TooManyClauses. We have a large amount oftext, and arequirement to do a wildcard query. Of course, it's waaaay too bigto useWildcard or the other "expanding" queries. They frighten meanyway.....
y'all pointed me at the ConstantScoreRangeQuery (CSRQ), butactually using
it is not making sense to me.
I just don't get how, for instance, CSRQ helps me that much. Say Iwant tosearch for big*er. I can use a CSRQ to get all the docs thatinclude thisterm, just by using biga and bigz as my min/max terms. But then I'mstuck. Icould iterate through all the docs returned, but that seemsinefficient. Notto mention that the HitCollector (?) class warns against this dueto "an
order of magnitude" decrease in response time.
What I *want* is a way to, for each doc in the CSRQ, get to answerwhetherit's a match. Really, on the order of a callback with the valuethat workedfor the CSRQ and the ability to return a yes/no or a ranking.Again, I can
interate all the docs matched, but this seems expensive.

Using filters doesn't really seem to do the trick for me either. If I
understand them properly, they allow me to set up a bitset for all the
documents that should be searched. All 1,000,000 of them? Or am Ithinking
about this completely backwards? I have LIA, but I'm also wondering if
there's something in 1.9 that I haven't found yet.

Now, given how easy the rest of Lucene is to use, I assume that I'm
approaching this poorly, but I sure am stumped.

All that said, I'm quite Java-naieve, so please bear with me if this
question demonstrates my ignorance painfully.....

Thanks
Erick



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: I just don't get wildcards at all.

Reply via email to