Hi Michael,
Michael Böckling wrote:
> Hi folks!
>
> The topic says it all: I want to modify the StandardAnalyzer so that it also
> splits words after punctuation characters (.,: etc.) that are NOT followed
> by a whitespace character, in addition to punctuation characters that ARE
> followed by w
Well, one possibility is to do something simpler. Rather than
modifying StandardAnalyzer, modify the input stream. That is,
substitute spaces for punctuation NOT followed by whitespace
and then just let the analyzer handle the result.
For that matter, if you're going to alter the input stream
bef
kenizing it as this:
all
one
located
92226-4446
E-A-R
-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 11, 2007 6:11 PM
To: java-user@lucene.apache.org
Subject: Re: Modifying StandardAnalyzer
Would it be simpler just to modify the input with
It won't do what I need. I may have something like:
"All-In-One is located in 92226-4446 and has an E-A-R"
I want it to be tokenized as follows:
all
one
located
92226
4446
E-A-R
Right now... it is tokenizing it as this:
all
one
located
92226-4446
E-A-R
Thats the type of information you
ssage-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 11, 2007 6:11 PM
To: java-user@lucene.apache.org
Subject: Re: Modifying StandardAnalyzer
Would it be simpler just to modify the input with a regex rather than
risk
messing with StandardANalyzer? Or wouldn't that do
Would it be simpler just to modify the input with a regex rather than risk
messing with StandardANalyzer? Or wouldn't that do what you need?
On 1/11/07, Van Nguyen <[EMAIL PROTECTED]> wrote:
Hi,
I need to modify the StandardAnalyzer so that it will tokenize zip codes
that look like this:
I would try adding this (or your regex)
| (("-"
)|())
between the EMAIL and HOST line or something,
And change this:
org.apache.lucene.analysis.Token next() throws IOException :
{
Token token = null;
}
{
( token = |
token = |
token = |
token = |
token = |
token = |