Re: Strange behaviour of StandardTokenizer

2010-06-21 Thread Anna Hunecke
18.6.2010: > Von: Simon Willnauer > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Freitag, 18. Juni, 2010 09:52 Uhr > Hi Anna, > > what are you using you tokenizer for? There are a lot of > different > options in lucene

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Simon Willnauer
for the explanation. :) > okay, so it is recognized as a number? I didn't expect that really. I expect > that all words are either split at the minus or not. > Maybe I'll have to use another tokenizer. > Best, > Anna > > --- Ahmet Arslan schrieb am Do, 17.6.2010

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Ahmet Arslan
> okay, so it is recognized as a number? Yes. You can see token type definitions in *.jflex file. > Maybe I'll have to use another tokenizer. MappingCharFilter with StandardTokenizer option exists. NormalizeCharMap map = new NormalizeCharMap(); map.add("-", " "); TokenStream stream = new Sta

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Anna Hunecke
Ahmet Arslan > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr > > > I ran into a strange behaviour of the > StandardTokenizer. > > Terms containing a '-' are tokenized differently >

Re: Strange behaviour of StandardTokenizer

2010-06-17 Thread Ahmet Arslan
> I ran into a strange behaviour of the StandardTokenizer. > Terms containing a '-' are tokenized differently depending > on the context. > For example, the term 'nl-lt' is split into 'nl' and 'lt'. > The term 'nl-lt0' is tokenized into 'nl-lt0'. > Is this a bug or a feature? It is designed tha

Strange behaviour of StandardTokenizer

2010-06-17 Thread Anna Hunecke
Hi! I ran into a strange behaviour of the StandardTokenizer. Terms containing a '-' are tokenized differently depending on the context. For example, the term 'nl-lt' is split into 'nl' and 'lt'. The term 'nl-lt0' is tokenized into 'nl-lt0'. Is this a bug or a feature? Can I avoid it somehow? I'm