Doron:

Thanks for the suggestion, I'll certainly put it on my list, depending upon
what the PM decides. This app is geneaology reasearch, and users *can* put
in their own wildcards...

This is why I love this list... lots of smart people giving me suggestions I
never would have thought of <G>...

Thanks
Erick

On 10/9/06, Doron Cohen <[EMAIL PROTECTED]> wrote:

"Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006 13:09:21:
> ... The kicker is that what we are indexing is
> OCR data, some of which is pretty trashy. So you wind up with
"interesting"
> words in your index, things like rtyHrS. So the whole question of
allowing
> very specific queries on detailed wildcards (combined with spans) is
under
> discussion. It's not at all clear to me that there's any value to the
end
> users in the capability of, say, two character prefixes. And, it's an
easy
> rule that "prefix queries must specify at least 3 non-wildcard
> characters"....

Erick, I may be out of course here, but, fwiw, have you considered n-gram
indexing/search for a degree of fuzziness to compensate for OCR errors..?

For a four words query you would probably get ~20 tokens (bigrams?) - no
matter what the index size is. You would then probably want to score
higher
by LA (lexical affinity - query terms appear close to each other in the
document) - and I am not sure to what degree a span query (made of n-gram
terms) would serve that, because (1) all terms in the span need to be
there
(well, I think:-); and, (2) you would like to increase doc score for
close-by terms only for close-by query n-grams.

So there might not be a ready to use solution in Lucene for this, but
perhaps this is a more robust direction to try than the wild card approach

- I mean, if users want to type a wild card query, it is their right to do
so, but for an application logic this does not seem the best choice.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to