[il-antlr-interest: 26215] [antlr-interest] More on Lexer 2-char seq handling

Graham Wideman Mon, 12 Oct 2009 17:33:43 -0700

Hi folks:

Further to the discussion on lexer matching sequence that should stop before 
some multi-character pattern:


I read Kirby's post with interest, including the list discussions pointed to.  
I'm not sure what to make of it.  The oddity to me is that ANTLR *almost* 
generates the right things:

1. mTokens does the right thing.

2. The lexer rule code that matches/consumes the string in question does look 
ahead and see the error it would make if it consumed the end-before-this 
pattern.

3. ANTLR just doesn't generate the code to look ahead and *predict* that it 
should *stop*, it only looks ahead enough to predict which alternative *might* 
succeed based on the first character.

Making matters quite odd is that you can fake ANTLR into generating the correct 
look-ahead, though not completely desirable code, as shown below:

In his last post Gavin recommended how to fix Martin Potier's PURETEXT token 
rule:
-------------------------
PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+ 
 ;
-------------------------
(I removed mentions of tokens that Martin didn't give a definition for).

However, this fails in the manner described above.  Instead, the grammar below 
contains a solution of sorts.

-------------------------
grammar Potier;

links: link+
 ;

link:
  LO PURETEXT ('|' PURETEXT)? LE 
    {System.out.println("Link: " + $link.text); }
  ;

LO      : '[[';         // Link opening
LE      : ']]';         // Link ending

PURETEXT: 
  ( 
      '[' ~'['
    | ']' ~']'
    | ~('\\' | '[' | ']' | '|' | '\n' )
  )+ 
  ']]'   // And delete the match("]]") from gen code
 ;
-------------------------

The only thing I've added is the additional requirement that PURETEXT end with 
']]'. This prompts ANTLR to generate LA(2) lookahead prediction code for the 
()+ block, and break out on seeing ']]' coming up.  

Now of course we don't want to include ']]' in PURETEXT, and this can be fixed 
by editing the match("]]"); out of the generated rule.

That results in the desired behavior, but obviously is a silly thing to have to 
do.

I have to admit that this seems like a problem with the algorithm that ANTLR 
uses to determine LA(>1) lookahead when generating the lexer code.

-- Graham










List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-interest@googlegroups.com
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~----------~----~----~----~------~----~------~--~---

[il-antlr-interest: 26215] [antlr-interest] More on Lexer 2-char seq handling

Reply via email to