RE: Inconsistent tokenizing of words containing underscores.

Aigner, Thomas Mon, 29 Aug 2005 15:12:32 -0700

What seems to be working for me is a punctuation filter that removes / -
_ etc and makes the token without them.  Then "most" of the time the
word XYZZZY_DE_SA0001 will be tokenized as XYZZZYDESA0001.  For this to
work, you will have to use the same punctuation filter on the strings
before you search for them.


Tom

-----Original Message-----
From: Daniel Naber [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 29, 2005 3:15 PM
To: java-user@lucene.apache.org
Subject: Re: Inconsistent tokenizing of words containing underscores.

On Monday 29 August 2005 19:21, Jeremy Meyer wrote:

> The expected behavior is to sometimes treat a character as indicating
a
> new token and other times to ignore the same character?

It depends on whether there are digits in the token.  It's documented in

the javacc source for the tokenizer(?).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Inconsistent tokenizing of words containing underscores.

Reply via email to