Hi folks: Further to the discussion on lexer matching sequence that should stop before some multi-character pattern:
I read Kirby's post with interest, including the list discussions pointed to. I'm not sure what to make of it. The oddity to me is that ANTLR *almost* generates the right things: 1. mTokens does the right thing. 2. The lexer rule code that matches/consumes the string in question does look ahead and see the error it would make if it consumed the end-before-this pattern. 3. ANTLR just doesn't generate the code to look ahead and *predict* that it should *stop*, it only looks ahead enough to predict which alternative *might* succeed based on the first character. Making matters quite odd is that you can fake ANTLR into generating the correct look-ahead, though not completely desirable code, as shown below: In his last post Gavin recommended how to fix Martin Potier's PURETEXT token rule: ------------------------- PURETEXT: ( '[' ~'[' | ']' ~']' | ~('\\' | '[' | ']' | '|' | '\n' ) )+ ; ------------------------- (I removed mentions of tokens that Martin didn't give a definition for). However, this fails in the manner described above. Instead, the grammar below contains a solution of sorts. ------------------------- grammar Potier; links: link+ ; link: LO PURETEXT ('|' PURETEXT)? LE {System.out.println("Link: " + $link.text); } ; LO : '[['; // Link opening LE : ']]'; // Link ending PURETEXT: ( '[' ~'[' | ']' ~']' | ~('\\' | '[' | ']' | '|' | '\n' ) )+ ']]' // And delete the match("]]") from gen code ; ------------------------- The only thing I've added is the additional requirement that PURETEXT end with ']]'. This prompts ANTLR to generate LA(2) lookahead prediction code for the ()+ block, and break out on seeing ']]' coming up. Now of course we don't want to include ']]' in PURETEXT, and this can be fixed by editing the match("]]"); out of the generated rule. That results in the desired behavior, but obviously is a silly thing to have to do. I have to admit that this seems like a problem with the algorithm that ANTLR uses to determine LA(>1) lookahead when generating the lexer code. -- Graham List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~----------~----~----~----~------~----~------~--~---