Re: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

2007-05-29 Thread Steven Rowe
Hi Michael, Michael Böckling wrote: > Hi folks! > > The topic says it all: I want to modify the StandardAnalyzer so that it also > splits words after punctuation characters (.,: etc.) that are NOT followed > by a whitespace character, in addition to punctuation characters that ARE > followed by w

Re: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

2007-05-29 Thread Erick Erickson
Well, one possibility is to do something simpler. Rather than modifying StandardAnalyzer, modify the input stream. That is, substitute spaces for punctuation NOT followed by whitespace and then just let the analyzer handle the result. For that matter, if you're going to alter the input stream bef

Re: Modifying StandardAnalyzer

2007-01-12 Thread Mark Miller
kenizing it as this: all one located 92226-4446 E-A-R -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 6:11 PM To: java-user@lucene.apache.org Subject: Re: Modifying StandardAnalyzer Would it be simpler just to modify the input with

Re: Modifying StandardAnalyzer

2007-01-12 Thread Mark Miller
It won't do what I need. I may have something like: "All-In-One is located in 92226-4446 and has an E-A-R" I want it to be tokenized as follows: all one located 92226 4446 E-A-R Right now... it is tokenizing it as this: all one located 92226-4446 E-A-R Thats the type of information you

RE: Modifying StandardAnalyzer

2007-01-12 Thread Van Nguyen
ssage- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 6:11 PM To: java-user@lucene.apache.org Subject: Re: Modifying StandardAnalyzer Would it be simpler just to modify the input with a regex rather than risk messing with StandardANalyzer? Or wouldn't that do

Re: Modifying StandardAnalyzer

2007-01-11 Thread Erick Erickson
Would it be simpler just to modify the input with a regex rather than risk messing with StandardANalyzer? Or wouldn't that do what you need? On 1/11/07, Van Nguyen <[EMAIL PROTECTED]> wrote: Hi, I need to modify the StandardAnalyzer so that it will tokenize zip codes that look like this:

Re: Modifying StandardAnalyzer

2007-01-11 Thread Mark Miller
I would try adding this (or your regex) | (("-" )|()) between the EMAIL and HOST line or something, And change this: org.apache.lucene.analysis.Token next() throws IOException : { Token token = null; } { ( token = | token = | token = | token = | token = | token = |