On Sat, Mar 12, 2005 at 03:14:33PM +0800, Autrijus Tang wrote: : Oh, btw, is there some more documents for the &statement:<> level : parsing and handling somewhere, or at least a general overview of : how those things are defined? :)
Below is an excerpt of something I sent Patrick last month that might provide a bit of help. A bit of background--for some time we've been proposing to use a hybrid parser with three layers: there's a top-down parser that gets down to the expression level, a bottom-up operator precedence parser that does expressions, and finally the "lexer" for each term or operator is again a top-down parser called by the operator precedence grammar whenever it needs the next lookahead. This keeps most of the benefits of top-down parsing while letting us avoid 24 or so levels of recursion on every term. It also lets us add new operators and precedence levels without having to recalculate the entire grammar after every definition. Anyway, the discussion below assumes that architecture. Larry ------------------------------------- [snip] But in the long run I think anything that can show up right after a term has to be recognized by the lexer in parallel, including infix ops. I think we have several "spots" that are combinations of various syntactic categories. To oversimplify, at the start of a statement the lexer can recognize in parallel any of: statement_control|term|prefix|label Otherwise if we're expecting a term, it's: term|prefix and if we're expecting an operator, it's: postfix|infix The intent of S5 redefinition of how %foo is matched is to allow those three main categories to each be represented by a single hash that is really a data structure functioning as a switch. But each of those hashes might switch out to any of several of the real syntactic categories, as shown by the | above. As I say, that's oversimplified. The real 3 states are closer to this: Statement: statement_control term prefix label scope_declarator Term: term prefix circumfix statement_modifier scope_declarator infix_postfix_meta_operator prefix_postfix_meta_operator Operator: postfix postfix_prefix_meta_operator postcircumfix infix infix_circumfix_meta_operator coerce statement_modifier statement_block Though I'm neglecting the fact that to handle our whitespace dependencies, some of these categories are split into two substates depending on whether we just traversed any whitespace. So there are really five main states (statements don't care about leading whitespace): Expect statement: statement_control label term prefix circumfix scope_declarator Expect term without <ws>: term prefix circumfix statement_modifier scope_declarator infix_postfix_meta_operator prefix_postfix_meta_operator Expect term after <ws>: term prefix circumfix statement_modifier scope_declarator Expect operator without <ws>: postfix (either dotted or undotted form) postfix_prefix_meta_operator (either dotted or undotted form) postcircumfix (either dotted or undotted form) infix (except those hidden by undotted postfix) infix_circumfix_meta_operator (except those hidden by undotted postfix) coerce statement_modifier Expect operator after <ws>: postfix (undotted, only if not hidden by infix) postfix (dotted) postfix_prefix_meta_operator (only if next postfix not hidden by infix) postcircumfix (dotted form only) infix infix_circumfix_meta_operator coerce statement_modifier statement_block Or something like that. There are other minor states, such as within declarations where we're looking for categories like trait_verb and trait_auxiliary, or within rules where we might pick up various rule modifiers and such. Or maybe those aren't really lexer states, if they're just used by token parser rules directly and aren't visible to the operator precedence grammar. But those five states above are the big lexer states. I say "lexer states", but these states are probably kept track of by the operator precedence parser, and it just calls into one of five rules that each start with one of our magical hashes that parallelize these various multiple user-visible syntactic categories. Does this give you a little better idea of where I'm pushing this? Actually, now that I think a little more, the bottom-up engine maybe doesn't have to know about <ws> if the 2nd and 4th states' hashes include significant whitespace entries that fall into the 3rd and 5th states automatically. Similarly, the statement-level hash could just defer to the 3rd hash if it doesn't recognize anything statement-like. Which means the operator precedence parser is back to knowing only two states, which is proper. Actually, the statement level parser just calls into the bottom-up parser, which in turn will start at the 3rd hash, assuming it starts up in expect-term-after-whitespace state. So the statement rule is basically: rule statement { %statementthing | { $\ := op_parse(3) } } or some such, where the op_parse function is what takes the place of your <expression> rule above. Larry