Hi Robert, thanks for the hint.
Indeed, a natural way to go. Especially if one builds a Tokenizer of the level of quality like StandardTokenizer's. OTOH, you mean that the out-of-the-box stuff is indeed not customizable for this task?.. regards Valery Robert Muir wrote: > > Valery, > > One thing you could try would be to create a JFlex-based tokenizer, > specifying a grammar with the rules you want. > You could use the source code & grammar of StandardTokenizer as a > starting point. > > > On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote: >> >> Hi all, >> >> I am trying to tune Lucene to respect such tokens like C++, C#, .NET >> >> The task is known for Lucene community, but surprisingly I can't google >> out >> somewhat good info on it. >> >> Of course, I tried to re-use Lucene's building blocks for Tokenizer. >> Here >> we go: >> >> 1) StandardTokenizer -- oh, this option would be just fantastic, but >> "C++, >> C#, .NET" ends up with "c c net". Too bad. >> >> 2) WhitespaceTokenizer gives me a lot of lexems that are actually should >> have been chopped into smaller pieces. Example: "C/C++" comes out like a >> single lexem. If I follow this way I end-up with "Tokenization of tokens" >> -- >> that sounds a bit odd, doesn't it? >> >> 3) CharTokenizer allows me to add the '/' to be also a token-emitting >> char, but then '/' gets immediately lost like those whitespace chars. In >> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search >> the >> original char stream for the "/" char to re-build "SAP R/3" term as a >> whole. >> >> Do you see any other relevant building blocks missed by me? >> >> Also, people around there have meant that such problem should be solved >> by a >> synonym dictionary. However this hint sheds no light on which >> tokenization >> strategy should be more appropriate *before* the synonym step. >> >> So, it looks like I have to take the class CharTokenizer as for the >> starting >> point and write anew my own Tokenizer. This Tokenizer should also react >> on >> delimiting characters and emit the token. However, it should distinguish >> between delimiters like whitespaces along with ";,?" and the delimiters >> like >> "./&". >> >> Indeed, the delimiters like whitespaces and ";,?" should be thrown away >> from >> Lexem level, >> whereas the token emitting characters like "./&" should be kept in Lexem >> level. >> >> Your comments, gurus? >> >> regards, >> Valery >> >> -- >> View this message in context: >> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org