Hi! Basically, what I want is something that removes punctuation. But I realized now that things like email or number recognition are also very useful if I want to give suggestions. I want to be able to give 'nl-lt001' as a suggestion when the user enters 'nl'. This would of course not be possible if the tokenizer just blindly splits at the '-'. So, I'll stick with the tokenizer for now and fix the problems I had with the splitting of words by building the queries differently. Thanks for your help!
- Anna --- Simon Willnauer <simon.willna...@googlemail.com> schrieb am Fr, 18.6.2010: > Von: Simon Willnauer <simon.willna...@googlemail.com> > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Freitag, 18. Juni, 2010 09:52 Uhr > Hi Anna, > > what are you using you tokenizer for? There are a lot of > different > options in lucene an StandardTokenizer is not necessarily > the best > one. The behaviour you are see is that the tokenizer > detects you token > as a number. When you look at the grammar that is kind of > obvious. > > <snip> > // floating point, serial, model numbers, ip addresses, > etc. > // every other segment must have at least one digit > NUM = ({ALPHANUM} {P} > {HAS_DIGIT} > | {HAS_DIGIT} > {P} {ALPHANUM} > | {ALPHANUM} > ({P} {HAS_DIGIT} {P} {ALPHANUM})+ > | {HAS_DIGIT} > ({P} {ALPHANUM} {P} {HAS_DIGIT})+ > | {ALPHANUM} > {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ > | {HAS_DIGIT} > {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) > > // punctuation > P > = ("_"|"-"|"/"|"."|",") > > </snip> > > you can either build your own custom filter which fixed > only the > problem with numbers containing a '- ', use the > MappingCharFilter or > switch to a different tokenizer. > If you could talk more about your usecase you might get > better suggestions. > > Simon > > On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <annahune...@yahoo.de> > wrote: > > Hi Ahmet, > > thanks for the explanation. :) > > okay, so it is recognized as a number? I didn't expect > that really. I expect that all words are either split at the > minus or not. > > Maybe I'll have to use another tokenizer. > > Best, > > Anna > > > > --- Ahmet Arslan <iori...@yahoo.com> > schrieb am Do, 17.6.2010: > > > >> Von: Ahmet Arslan <iori...@yahoo.com> > >> Betreff: Re: Strange behaviour of > StandardTokenizer > >> An: java-user@lucene.apache.org > >> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr > >> > >> > I ran into a strange behaviour of the > >> StandardTokenizer. > >> > Terms containing a '-' are tokenized > differently > >> depending > >> > on the context. > >> > For example, the term 'nl-lt' is split into > 'nl' and > >> 'lt'. > >> > The term 'nl-lt0' is tokenized into > 'nl-lt0'. > >> > Is this a bug or a feature? > >> > >> It is designed that way. TypeAttribute of those > tokens are > >> different. > >> > >> > Can I avoid it somehow? > >> > >> Do you want to split at '-' char no matter what? > If yes, > >> you can replace all '-' characters with whitespace > using > >> MappingCharFilter before StandardTokenizer. > >> > >> > >> > >> > >> > --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org