Oops! I meant to reply to the list, thanks for pointing out I failed to do that.
On Mon, Oct 12, 2009 at 9:45 PM, Graham Wideman <gwl...@grahamwideman.com> wrote: > Hi folks: > > I've been scouring the list archives trying to scrape together a better > understanding of what variants of lexer code ANTLR might generate. > > I keep seeing assertions that ANTLR lexers use longest match to select > between otherwise ambiguous alternatives, and other assertions that it > doesn't. > > In the code I see generated, I don't see any longest match logic, but maybe > it only appears under particluar grammar arrangements or options? > > Or are people using "longest match" when they really mean "once mToken has > chosen a rule, then that rule will act greedy" (unless greedy=false)? > > I'd appreciate if someone could clarify with an example that shows where > longest match comes into play. Or confirm that ANTLR actually doesn't use > that strategy? > I believe it does use longest match, it can easily handle: OPEN_TABLE: '{|'; OPEN_TEMPLATE: '{{'; OPEN_FUNC: '{{{'; OPEN_CURLY: '{'; It'll handle all of those without issue and '{{{' will generate OPEN_FUNC not OPEN_TEMPLATE followed by OPEN_CURLY, or 3 OPEN_CURLY's. In fact, I don't think you even have to put them in the "proper" order. The big difference here is that there is no series of characters where in the middle, it is invalid. That is the problem with most of these. Folks expect that the lexer will backout to the last valid point if there is a point where it could have previously exited cleanly. That just isn't the case. The lexer finishes what it starts, or errors out. So don't let it start something unless you know it can successfully finish it. As to another threads assertion that the lexer is LL(1), that is not the case. If two lexer rules match for 20 characters, the lexer will look ahead 21 character to identify which rule it is. As long as it is not LL ambiguous, I believe it's no big deal. To prove that try doing: FA: .. 'a'; FB: .. 'b'; F_OTHER: .. ~('a'|'b'); start: (FA|FB|F_OTHER)+ EOF; Put in 'ffa', 'ffb' 'ffc' and that should return FA, FB, F_OTHER respectively, and in this case, it had to look ahead at least 3 characters to determine which rule it is. It'll blow up if you don't have a multiple of 3 characters, but that is a different problem (starting something it can't finish). So if you have rules like: TILDA: '~' T3: '~~~'; T4: '~~~~'; T5: '~~~~~'; If you give it input of '~', '~~~', '~~~~', '~~~~~', and '~~~~~~', it will do what you expect (TILDA, T3, T4, T5, and T5 followed by TILDA respectively). However, the input of '~~' will screw up the lexer and cause an exception. I believe this is because each rule is thrown out once it no longer matches. So on the first '~', the DFA knows that all 4 rules are potential outcomes. Once it sees the second '~', it knows that only T3, T4, or T5 are valid inputs. The next input is EOF, which isn't a valid option for any of the existing live rules, so the lexer punts an error/exception. You can easily use lookahead to make T3, T4, T5 invalid rules if there aren't enough TILDA's prior to matching the first character, options T3, T4, T5 are eliminated by the lookahead, TILDA is the only valid option. Upon the second '~', it doesn't match TILDA any longer and TILDA was the last valid rule, so emit TILDA (had there been multiple valid rules, it would have emitted the first one listed). So emit one TILDA and start lexing with the second '~' character. The problem is when the lexer is "mid-rule", and then has a mismatch. It does not go back to the last valid match at a boundary (in this case above with no look ahead predicates, it would not fall back to a single '~' and emit TILDA). It is not at the end of *ANY* rule when the EOF is the next character from the Lexer. So you aren't at a valid boundary case of any lexer rule, and the next character makes all of the currently valid alternatives fail the lexer errors out. This is also why Joan Pujol's lexer is having issues, in the other thread. The '$' character matches but the following '{' can't be matched, and is not at a valid boundary condition. Thus an exception (the OTHER rule can only match one character, so by the second character it is no longer a valid option, so TEXT was the only option, it started the second alternative, and couldn't finish it). I think if you generate the NFA/DFA's for the lexer, a lot of this becomes more apparent, but I've always had problems. Until sitting down and thinking clearly about this and the problems I've had, the lexer was mostly black magic. Now if you enable backtracking on the lexer, it'll help with some of these problems. I can't say for sure what backtracking will do in the lexer, but I know it does change behavior. Hopefully somebody will correct me if I'm wrong, but with the problems I've seen, and the solutions I've cooked up to work around them, I believe this to be a mostly accurate description. None of this is because I understand LL(*) parsing, it is mostly empirically gained knowledge by screwing with a lexer that has lots of these types of problems. In the end, it's almost always easier to lex single characters if it is ambiguous and you need context to figure it out. Deal with it in the parser, where backtracking works just like you think it does. I ended up throwing away all of the "{{{", "{{", "{", and "{|" in the grammar I was working on and just lexed each "{" and "|" separately, and then parsed them and created imaginary tokens. Cheers, Kirby > Thanks, > > Graham > > List: http://www.antlr.org/mailman/listinfo/antlr-interest > Unsubscribe: > http://www.antlr.org/mailman/options/antlr-interest/your-email-address > List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~----------~----~----~----~------~----~------~--~---