Re: Parsing indent-sensitive languages

Peri Hankey Fri, 09 Sep 2005 05:19:44 -0700

Dave Whipp wrote:

If I want to parse a language that is sensitive to whitespaceindentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?
The way I'd usually handle it is to have a lexer that examines leadingwhitespace and converts it into "indent" and "unindent" tokens. Thegrammer can then use these tokens in the same way that it would anyother block-delimiter.
This requires a stateful lexer, because to work out the number of"unindent" tokens on a line, it needs to know what the indentationpositions are. How would I write a P6 rule that defines <indent> and<unindent> tokens? Alternatively (if a different approach is needed) howwould I use P6 to parse such a language?

In this context, I thought readers of this list might be interested inthe following short extract from mediawiki.lmn (rules in themetalanguage of my language machine) which translate a subset of themediawiki markup notation to HTML. The extract deals with bulleted andnumbered lists, where consecutive prefix characters '*' and '#' are usedto indicate the level of nesting of eacn entry:


------------- start of extract from mediawiki.lmn --------------------
== bulleted and numbered lists ==

Unordered and ordered lists are a bit tricky - essentially they are likeindented blocks in Python, but a little more complex because of the wayordered and unordered lists can be combined with each other. Thesolution is that at each level, the prefix pattern of '#' and '*'characters is known, and the level continues while that pattern isrecognised. This can be done by matching the value of a variable whichholds the pattern for the current level.


    '*'                                  <- unit - ulist :'*';
    '#'                                  <- unit - olist :'#';
    ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
    olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

    '*'                                  <- item - ulist :{A'*'};
    '#'                                  <- item - olist :{A'#'};
    ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
    olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
    - wikitext :X                        <- item :{ li :X };

The following rule permits a level to continue as long as the inputmatches the current prefix. We recurse for each level before gettinghere, so we will always try to match the innermost levels first - theyhave the longest prefix strings, and so there is no danger of apremature match


    - A                                  <- more ;
------------- end of extract from mediawiki.lmn ----------------------

The complete ruleset is visible at:
http://languagemachine.sourceforge.net/website.html   - summary
http://languagemachine.sourceforge.net/mediawiki.html - markup
http://languagemachine.sourceforge.net/sitehtml.html  - wrappings

I have fairly recently published the the language machine under Gnu GPLat sourceforge. It is implemented as a shared library written in the Dlanguage using the gdc frontend to gnu gcc. There are several flavoursof the lmn metalanguage compiler: these are all written in lmn and sharea common frontend. These and a number of examples are on the website aspages that have been generated directly from the source text.

My intention in creating the language machine has been to createsomething that can be combined with other free software languages andtoolchains. I have recently asked the grants-secretary of the PerlFoundation for feedback on a proposal for implementing a languagemachine extension module for perl.

The language machine is not much like any other language toolkit that Iknow of. There is a page which tries to explain how it relates ot thereceived wisdom about language and language implementations at:


http://languagemachine.sourceforge.net/grammar.html

The language machine can produce a good deal of diagnostic information,including a very useful diagram which shows exactly what happens whenunrestricted grammatical substitution rules are applied to an input stream:


http://languagemachine.sourceforge.net/lm-diagram.html

I would be interested to hear what you think.

Regards
Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

Reply via email to