Hi Robert, so, would you expect a Tokenizer to consider '/' in both cases as a separate Token?
Personally, I see no problem if Tokenzer would do the following job: "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} and come up with "C" and "C++" tokens after processing through the downstream tokenfilters. Similarly: "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later. I try to follow a spirit that a token (or its lexem) usually should never be parsed again. One can build more complex (compound) things from the tokens. However, usually one never chops a lexem into smaller pieces. What do you think, Robert? regards, Valery -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org