Re: Strange behaviour of StandardTokenizer

Anna Hunecke Fri, 18 Jun 2010 00:04:12 -0700

Hi Ahmet,
thanks for the explanation. :)
okay, so it is recognized as a number? I didn't expect that really. I expect 
that all words are either split at the minus or not.
Maybe I'll have to use another tokenizer.
Best,
Anna


--- Ahmet Arslan <iori...@yahoo.com> schrieb am Do, 17.6.2010:

> Von: Ahmet Arslan <iori...@yahoo.com>
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
> 
> > I ran into a strange behaviour of the
> StandardTokenizer.
> > Terms containing a '-' are tokenized differently
> depending
> > on the context. 
> > For example, the term 'nl-lt' is split into 'nl' and
> 'lt'.
> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
> > Is this a bug or a feature? 
> 
> It is designed that way. TypeAttribute of those tokens are
> different.
> 
> > Can I avoid it somehow?
> 
> Do you want to split at '-' char no matter what? If yes,
> you can replace all '-' characters with whitespace using
> MappingCharFilter before StandardTokenizer. 
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Reply via email to