Yes, you could even use the WhitespaceTokenizer and then look for the symbols in a token filter. You would get [you?] as a single token; your job in the token filter is then to store the [?] and return the [you]. The next time the token filter is called for the next token, you return the [?] that you stored previously.

If you're already using something that's grammar-based (such as StandardTokenizer) then you could add the "?" to the grammar as a separate token. If you can figure out how to do this from looking at the grammar file, then it's probably the simplest way.

-John

Matthew Hall wrote:
I'd think extending WhiteSpaceTokenizer would be a good place to start.

Then create a new Analyzer that exactly mirrors your current Analyzer, with the exception that it uses your new tokenizer instead of WhiteSpaceTokenizer (Well.. there is of course my assumption that you are using an Analyzer that already uses WhiteSpaceTokenizer... but you likely are)

OBender wrote:
Hi All,

I need to make ? and ! characters to be a separate token e.g. to split [how are you?] in to 4 tokens [how], [are], [you] and [?] what would be the best
way to do this?

Thanks




------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.18/2243 - Release Date: 07/17/09 06:08:00



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to