Hi Simon,

Simon Willnauer wrote:
> 
> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and
> [...]?!
> 
> simon
> 

yes, I did, please find the info in the initial message. Here are the
excerpts:


Valery wrote:
> 
>   2) WhitespaceTokenizer gives me a lot of lexems that are actually should
> have been chopped into smaller pieces. Example: "C/C++" comes out like a
> single lexem. If I follow this way I end-up with "Tokenization of tokens"
> -- that sounds a bit odd, doesn't it?
> 
>   3) CharTokenizer allows me to add the '/' to be also a token-emitting
> char, but then '/' gets immediately lost like those whitespace chars. In
> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
> the original char stream for the "/" char to re-build "SAP R/3" term as a
> whole. 

regards,
Valery



Simon Willnauer wrote:
> 
> Valery, have you tried to [...] and do any further processing in a  custom
> TokenFilter?!
> simon
> 

yes, and that's why I have sent the initial post "Any Tokenizator friendly
to C++, C#, .NET, etc ?"

Simon, what do you expect from the Tokenizer? 
(In other words, what job is exclusively "Tokenizer's Job" and should rather
not be done in downstream filters?)

regards, 
Valery

-- 
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to