18.6.2010:
> Von: Simon Willnauer
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Freitag, 18. Juni, 2010 09:52 Uhr
> Hi Anna,
>
> what are you using you tokenizer for? There are a lot of
> different
> options in lucene
for the explanation. :)
> okay, so it is recognized as a number? I didn't expect that really. I expect
> that all words are either split at the minus or not.
> Maybe I'll have to use another tokenizer.
> Best,
> Anna
>
> --- Ahmet Arslan schrieb am Do, 17.6.2010
> okay, so it is recognized as a number?
Yes. You can see token type definitions in *.jflex file.
> Maybe I'll have to use another tokenizer.
MappingCharFilter with StandardTokenizer option exists.
NormalizeCharMap map = new NormalizeCharMap();
map.add("-", " ");
TokenStream stream = new Sta
Ahmet Arslan
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
>
> > I ran into a strange behaviour of the
> StandardTokenizer.
> > Terms containing a '-' are tokenized differently
>
> I ran into a strange behaviour of the StandardTokenizer.
> Terms containing a '-' are tokenized differently depending
> on the context.
> For example, the term 'nl-lt' is split into 'nl' and 'lt'.
> The term 'nl-lt0' is tokenized into 'nl-lt0'.
> Is this a bug or a feature?
It is designed tha
Hi!
I ran into a strange behaviour of the StandardTokenizer. Terms containing a '-'
are tokenized differently depending on the context.
For example, the term 'nl-lt' is split into 'nl' and 'lt'.
The term 'nl-lt0' is tokenized into 'nl-lt0'.
Is this a bug or a feature? Can I avoid it somehow?
I'm