Hi Ken, thanks for the comments. Well, Terrence's ANTLR was and is a good piece of work.
Do you mean that you use ANTLR to generate a Tokenzer (lexem parser) or did you even proceed further and used ANTLR to generate higher level parsers to overrule Lucene's TokenFilters? or maybe even both?.. regards, Valery Ken Krugler wrote: > > Hi Valery, > > From our experience at Krugle, we wound up having to create our own > tokenizers (actually kind of specialized parser) for the different > languages. It didn't seem like a good option to try to twist one of > the existing tokenizers into something that would work well enough. We > wound up using ANTLR for this. > > -- Ken > > > On Aug 20, 2009, at 8:09am, Valery wrote: > >> >> Hi Robert, >> >> thanks for the hint. >> >> Indeed, a natural way to go. Especially if one builds a Tokenizer of >> the >> level of quality like StandardTokenizer's. >> >> OTOH, you mean that the out-of-the-box stuff is indeed not >> customizable for >> this task?.. >> >> regards >> Valery >> >> >> >> Robert Muir wrote: >>> >>> Valery, >>> >>> One thing you could try would be to create a JFlex-based tokenizer, >>> specifying a grammar with the rules you want. >>> You could use the source code & grammar of StandardTokenizer as a >>> starting point. >>> >>> >>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote: >>>> >>>> Hi all, >>>> >>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET >>>> >>>> The task is known for Lucene community, but surprisingly I can't >>>> google >>>> out >>>> somewhat good info on it. >>>> >>>> Of course, I tried to re-use Lucene's building blocks for >>>> Tokenizer. >>>> Here >>>> we go: >>>> >>>> 1) StandardTokenizer -- oh, this option would be just fantastic, >>>> but >>>> "C++, >>>> C#, .NET" ends up with "c c net". Too bad. >>>> >>>> 2) WhitespaceTokenizer gives me a lot of lexems that are actually >>>> should >>>> have been chopped into smaller pieces. Example: "C/C++" comes out >>>> like a >>>> single lexem. If I follow this way I end-up with "Tokenization of >>>> tokens" >>>> -- >>>> that sounds a bit odd, doesn't it? >>>> >>>> 3) CharTokenizer allows me to add the '/' to be also a token- >>>> emitting >>>> char, but then '/' gets immediately lost like those whitespace >>>> chars. In >>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to >>>> search >>>> the >>>> original char stream for the "/" char to re-build "SAP R/3" term >>>> as a >>>> whole. >>>> >>>> Do you see any other relevant building blocks missed by me? >>>> >>>> Also, people around there have meant that such problem should be >>>> solved >>>> by a >>>> synonym dictionary. However this hint sheds no light on which >>>> tokenization >>>> strategy should be more appropriate *before* the synonym step. >>>> >>>> So, it looks like I have to take the class CharTokenizer as for the >>>> starting >>>> point and write anew my own Tokenizer. This Tokenizer should also >>>> react >>>> on >>>> delimiting characters and emit the token. However, it should >>>> distinguish >>>> between delimiters like whitespaces along with ";,?" and the >>>> delimiters >>>> like >>>> "./&". >>>> >>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown >>>> away >>>> from >>>> Lexem level, >>>> whereas the token emitting characters like "./&" should be kept in >>>> Lexem >>>> level. >>>> >>>> Your comments, gurus? >>>> >>>> regards, >>>> Valery >>>> >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html >>>> Sent from the Lucene - Java Users mailing list archive at >>>> Nabble.com. >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -------------------------- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-210-6378 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066540.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org