Re: [PUGS] patch - few hyperops

Larry Wall Sat, 12 Mar 2005 00:06:32 -0800

On Sat, Mar 12, 2005 at 03:14:33PM +0800, Autrijus Tang wrote:
: Oh, btw, is there some more documents for the &statement:<> level
: parsing and handling somewhere, or at least a general overview of
: how those things are defined? :)


Below is an excerpt of something I sent Patrick last month that might
provide a bit of help.  A bit of background--for some time we've been
proposing to use a hybrid parser with three layers: there's a top-down
parser that gets down to the expression level, a bottom-up operator
precedence parser that does expressions, and finally the "lexer"
for each term or operator is again a top-down parser called by the
operator precedence grammar whenever it needs the next lookahead.
This keeps most of the benefits of top-down parsing while letting us
avoid 24 or so levels of recursion on every term.  It also lets us
add new operators and precedence levels without having to recalculate
the entire grammar after every definition.  Anyway, the discussion
below assumes that architecture.

Larry
-------------------------------------
[snip]
But in the long run I think anything that can show up right after
a term has to be recognized by the lexer in parallel, including
infix ops.  I think we have several "spots" that are combinations of
various syntactic categories.  To oversimplify, at the start of a
statement the lexer can recognize in parallel any of:

    statement_control|term|prefix|label

Otherwise if we're expecting a term, it's:

    term|prefix

and if we're expecting an operator, it's:

    postfix|infix

The intent of S5 redefinition of how %foo is matched is to allow
those three main categories to each be represented by a single hash
that is really a data structure functioning as a switch.  But each of
those hashes might switch out to any of several of the real syntactic
categories, as shown by the | above.

As I say, that's oversimplified.  The real 3 states are closer to this:

Statement:
    statement_control
    term
    prefix
    label
    scope_declarator

Term:
    term
    prefix
    circumfix
    statement_modifier
    scope_declarator
    infix_postfix_meta_operator
    prefix_postfix_meta_operator

Operator:
    postfix
    postfix_prefix_meta_operator
    postcircumfix
    infix
    infix_circumfix_meta_operator
    coerce
    statement_modifier
    statement_block

Though I'm neglecting the fact that to handle our whitespace
dependencies, some of these categories are split into two substates
depending on whether we just traversed any whitespace.  So there
are really five main states (statements don't care about leading
whitespace):

Expect statement:
    statement_control
    label
    term
    prefix
    circumfix
    scope_declarator

Expect term without <ws>:
    term
    prefix
    circumfix
    statement_modifier
    scope_declarator
    infix_postfix_meta_operator
    prefix_postfix_meta_operator

Expect term after <ws>:
    term
    prefix
    circumfix
    statement_modifier
    scope_declarator

Expect operator without <ws>:
    postfix (either dotted or undotted form)
    postfix_prefix_meta_operator (either dotted or undotted form)
    postcircumfix (either dotted or undotted form)
    infix (except those hidden by undotted postfix)
    infix_circumfix_meta_operator (except those hidden by undotted postfix)
    coerce
    statement_modifier

Expect operator after <ws>:
    postfix (undotted, only if not hidden by infix)
    postfix (dotted)
    postfix_prefix_meta_operator (only if next postfix not hidden by infix)
    postcircumfix (dotted form only)
    infix
    infix_circumfix_meta_operator
    coerce
    statement_modifier
    statement_block

Or something like that.  There are other minor states, such as within
declarations where we're looking for categories like trait_verb and
trait_auxiliary, or within rules where we might pick up various rule
modifiers and such.  Or maybe those aren't really lexer states, if
they're just used by token parser rules directly and aren't visible
to the operator precedence grammar.

But those five states above are the big lexer states.  I say "lexer
states", but these states are probably kept track of by the operator
precedence parser, and it just calls into one of five rules that each
start with one of our magical hashes that parallelize these various
multiple user-visible syntactic categories.

Does this give you a little better idea of where I'm pushing this?

Actually, now that I think a little more, the bottom-up engine maybe
doesn't have to know about <ws> if the 2nd and 4th states' hashes
include significant whitespace entries that fall into the 3rd and
5th states automatically.  Similarly, the statement-level hash could
just defer to the 3rd hash if it doesn't recognize anything statement-like.
Which means the operator precedence parser is back to knowing only
two states, which is proper.  Actually, the statement level parser
just calls into the bottom-up parser, which in turn will start at
the 3rd hash, assuming it starts up in expect-term-after-whitespace
state.

So the statement rule is basically:

    rule statement { %statementthing | { $\ := op_parse(3) } }

or some such, where the op_parse function is what takes the place of
your <expression> rule above.

Larry

Re: [PUGS] patch - few hyperops

Reply via email to