On Fri, Aug 21, 2009 at 10:26 AM, Valery<khame...@gmail.com> wrote: > > Hi Simon, > > > Simon Willnauer wrote: >> >> Valery, have you tried to use whitespaceTokenizer / CharTokenizer and >> [...]?! >> >> simon >> > > yes, I did, please find the info in the initial message. Here are the > excerpts: > > > Valery wrote: >> >> 2) WhitespaceTokenizer gives me a lot of lexems that are actually should >> have been chopped into smaller pieces. Example: "C/C++" comes out like a >> single lexem. If I follow this way I end-up with "Tokenization of tokens" >> -- that sounds a bit odd, doesn't it? >> >> 3) CharTokenizer allows me to add the '/' to be also a token-emitting >> char, but then '/' gets immediately lost like those whitespace chars. In >> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search >> the original char stream for the "/" char to re-build "SAP R/3" term as a >> whole. > > regarding this part: > > > Simon Willnauer wrote: >> >> Valery, have you tried to [...] and do any further processing in a custom >> TokenFilter?! >> simon >> > > yes, and that's why I have sent the initial post "Any Tokenizator friendly > to C++, C#, .NET, etc ?" > Actually, I am a bit confused to do a Tokenizer's job in filters and > re-parse the char stream. > > Simon, what do you expect from the Tokenizer? > (In other words, what job is exclusively "Tokenizer's Job" and should rather > not be done in downstream filters?)
Well, Tokenizer, TokenFilter both of them are sub are subclasses of TokenStream while their input differ. A Tokenizer gets the input from a reader and creates Tokens from this input. A TokenFilter uses the tokens created by the Tokenizer and does further processing. For instance. An Analyzer that uses WhitespaceTokenizer as an input for LowerCaseFilter would produce the following: Input: C# or .NET WhitespaceTokenizer: [Tokenstring: "C#"; offset: 0->2; pos: 1] [Tokenstring: "or"; offset: 3->5; pos: 2] [Tokenstring: ".Net"; offset: 6->10; pos: 3] LowerCaseFilter: [Tokenstring: "c#"; offset: 0->2; pos: 1] [Tokenstring: "or"; offset: 3->5; pos: 2] [Tokenstring: ".net"; offset: 6->10; pos: 3] if you wanna do any further processing with those tokens you can add your own TokenFilter and modify the tokens as you need. you could do the whole job in a Tokenizer but this would not be a good separation of concerns right!? simon > > regards, > Valery > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25075903.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org