Re: Strange behaviour of StandardTokenizer

Anna Hunecke Mon, 21 Jun 2010 02:03:52 -0700

Hi!

Basically, what I want is something that removes punctuation. 
But I realized now that things like email or number recognition are also very 
useful if I want to give suggestions. I want to be able to give 'nl-lt001' as a 
suggestion when the user enters 'nl'. This would of course not be possible if 
the tokenizer just blindly splits at the '-'. 
So, I'll stick with the tokenizer for now and fix the problems I had with the 
splitting of words by building the queries differently.
Thanks for your help!


- Anna

--- Simon Willnauer <simon.willna...@googlemail.com> schrieb am Fr, 18.6.2010:

> Von: Simon Willnauer <simon.willna...@googlemail.com>
> Betreff: Re: Strange behaviour of StandardTokenizer
> An: java-user@lucene.apache.org
> Datum: Freitag, 18. Juni, 2010 09:52 Uhr
> Hi Anna,
> 
> what are you using you tokenizer for? There are a lot of
> different
> options in lucene an StandardTokenizer is not necessarily
> the best
> one. The behaviour you are see is that the tokenizer
> detects you token
> as a number. When you look at the grammar that is kind of
> obvious.
> 
> <snip>
> // floating point, serial, model numbers, ip addresses,
> etc.
> // every other segment must have at least one digit
> NUM        = ({ALPHANUM} {P}
> {HAS_DIGIT}
>            | {HAS_DIGIT}
> {P} {ALPHANUM}
>            | {ALPHANUM}
> ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>            | {HAS_DIGIT}
> ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>            | {ALPHANUM}
> {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>            | {HAS_DIGIT}
> {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> 
> // punctuation
> P         
>    = ("_"|"-"|"/"|"."|",")
> 
> </snip>
> 
> you can either build your own custom filter which fixed
> only the
> problem with numbers containing a '- ', use the
> MappingCharFilter or
> switch to a different tokenizer.
> If you could talk more about your usecase you might get
> better suggestions.
> 
> Simon
> 
> On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <annahune...@yahoo.de>
> wrote:
> > Hi Ahmet,
> > thanks for the explanation. :)
> > okay, so it is recognized as a number? I didn't expect
> that really. I expect that all words are either split at the
> minus or not.
> > Maybe I'll have to use another tokenizer.
> > Best,
> > Anna
> >
> > --- Ahmet Arslan <iori...@yahoo.com>
> schrieb am Do, 17.6.2010:
> >
> >> Von: Ahmet Arslan <iori...@yahoo.com>
> >> Betreff: Re: Strange behaviour of
> StandardTokenizer
> >> An: java-user@lucene.apache.org
> >> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
> >>
> >> > I ran into a strange behaviour of the
> >> StandardTokenizer.
> >> > Terms containing a '-' are tokenized
> differently
> >> depending
> >> > on the context.
> >> > For example, the term 'nl-lt' is split into
> 'nl' and
> >> 'lt'.
> >> > The term 'nl-lt0' is tokenized into
> 'nl-lt0'.
> >> > Is this a bug or a feature?
> >>
> >> It is designed that way. TypeAttribute of those
> tokens are
> >> different.
> >>
> >> > Can I avoid it somehow?
> >>
> >> Do you want to split at '-' char no matter what?
> If yes,
> >> you can replace all '-' characters with whitespace
> using
> >> MappingCharFilter before StandardTokenizer.
> >>
> >>
> >>
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Strange behaviour of StandardTokenizer

Reply via email to