Hi Robert, 

so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?

Personally, I see no problem if Tokenzer would do the following job:

"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} 
and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.

Similarly:

"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} 
and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.

I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build  more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.

What do you think, Robert?

regards,
Valery

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to