Hi Ahmet, thanks for the explanation. :) okay, so it is recognized as a number? I didn't expect that really. I expect that all words are either split at the minus or not. Maybe I'll have to use another tokenizer. Best, Anna
--- Ahmet Arslan <iori...@yahoo.com> schrieb am Do, 17.6.2010: > Von: Ahmet Arslan <iori...@yahoo.com> > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr > > > I ran into a strange behaviour of the > StandardTokenizer. > > Terms containing a '-' are tokenized differently > depending > > on the context. > > For example, the term 'nl-lt' is split into 'nl' and > 'lt'. > > The term 'nl-lt0' is tokenized into 'nl-lt0'. > > Is this a bug or a feature? > > It is designed that way. TypeAttribute of those tokens are > different. > > > Can I avoid it somehow? > > Do you want to split at '-' char no matter what? If yes, > you can replace all '-' characters with whitespace using > MappingCharFilter before StandardTokenizer. > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org