Hi Erick, Thanks for the suggestions. I've used indexed n-grams before to implement spell-checking; I think in this case I may take a look at WildcardTermEnum and RegexTermEnum. It seems like a good solution because I am doing my own results ordering so Lucene's scoring is irrelevant in this case. I wasn't aware of these classes so thanks for mentioning them!
Best, Mark On Wed, Jun 25, 2008 at 12:25 PM, Erick Erickson <[EMAIL PROTECTED]> wrote: > Warning: I don't understand ngrams at all, so you should > read this as a plea for those who do to tell me I'm off base <G>. > > > But I wonder if indexing as n-grams would be a way to > cope with this issue that lots of people have. *assuming* > you are thinking about single terms, then it seems that > "smith" would be tokenized as sm, mi, it, th. Then > a wildcard search for "mi it" would hit (as a phrase > query or a SpanQuery with slop of 0). It seems like there > are several issues to work out here, especially including > multiple terns, matching mixtures of wildcards and > non-wildcards, etc. > > But it seems do-able.... > > > Another approach is to use WildcardTernEnum and/or > RegexTermEnum to build up a filter and use the filter as > part of the query. What you loose with this approach is > that the filter (and wildcards) then don't contribute to > scoring. But this isn't a huge price to pay... > > Best > Erick > > On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson <[EMAIL PROTECTED]> > wrote: > > > Hello, > > > > I am currently keeping an index of all our client's usernames. The search > > functionality is implemented using a PrefixFilter. However, we would like > > to > > expand the functionality to be able to search any part of a user's name, > > rather than requiring that it begin with the query string. So for > example, > > the search term 'mit' would return the username 'smith'. > > > > I am hesitant to use a WildcardQuery starting with an asterisk because > I've > > read about why this is a bad idea. I am looking for suggestions on the > best > > way to implement this. > > > > The idea I've come up with is to index each part of the username; so for > > example, if the username is 'mark', you would index mark, ark, rk, and k. > > Then you could still use the PrefixFilter. I'm not overly concerned about > > how this would enlarge the index because usernames tend to be fairly > short. > > > > I am very much open to other suggestions however. Does anyone have any > > opinions or ideas that they can share? > > > > Thanks very much. > > > > Mark > > >