Hi Simon,
Simon Willnauer wrote: > > Valery, have you tried to use whitespaceTokenizer / CharTokenizer and > [...]?! > > simon > yes, I did, please find the info in the initial message. Here are the excerpts: Valery wrote: > > 2) WhitespaceTokenizer gives me a lot of lexems that are actually should > have been chopped into smaller pieces. Example: "C/C++" comes out like a > single lexem. If I follow this way I end-up with "Tokenization of tokens" > -- that sounds a bit odd, doesn't it? > > 3) CharTokenizer allows me to add the '/' to be also a token-emitting > char, but then '/' gets immediately lost like those whitespace chars. In > result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search > the original char stream for the "/" char to re-build "SAP R/3" term as a > whole. regards, Valery Simon Willnauer wrote: > > Valery, have you tried to [...] and do any further processing in a custom > TokenFilter?! > simon > yes, and that's why I have sent the initial post "Any Tokenizator friendly to C++, C#, .NET, etc ?" Simon, what do you expect from the Tokenizer? (In other words, what job is exclusively "Tokenizer's Job" and should rather not be done in downstream filters?) regards, Valery -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org