DOM parsing in bison (was: Parsing a language with optional spaces)

Adrian Vogelsgesang Sat, 18 Jul 2020 01:27:01 -0700

Hi Akim, hi Christian,

Really interesting thread and discussion!
I almost feel guilty for only picking up one particular point ;)


Akim's comment
> No, it's also about the targeted model.  Most other GLR parser generators
> address a somewhat simpler issue: their parsers generate the final AST.
> They are DOM-parsers.  Bison is more like SAX-parsers: we run user actions.
> Memory management is really a problem here.
touches an interesting point to me.

It made me realize: We in Hyper are actually using bison only to generate a DOM 
tree. We built our own abstraction on top of bison, a "bison preprocessor" 
which translates a kind of ad-hoc "DOM-description language" used in the parser 
actions into actual C++ code for our actions which can then be fed into bison

Would it be valuable to have something similar in bison?
I.e. teach bison some kind of short-hand syntax which can be used in actions 
and to create a DOM tree?
Would others also be interested in such a DOM-building capability?

Cheers,
Adrian


On 18/07/2020, 08:47, "help-bison on behalf of Akim Demaille" 
<help-bison-bounces+avogelsgesang=tableau....@gnu.org on behalf of 
a...@lrde.epita.fr> wrote:

    
    
    > Le 14 juil. 2020 à 13:31, Christian Schoenebeck 
<schoeneb...@crudebyte.com> a écrit :
    > 
    > On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
    > 
    > For the 'married' parser, in the form imagined by me, there would be no 
    > lookahead token, as a lookahead token implies a context-unaware scanner.
    > 
    > Instead, when the parser would get to a state where it had to decide 
between 
    > reducing or shifting (depending on a next token), it would not decide and 
fork 
    > the parser state instead, in order to allow context-specific tokenization.
    
    That's indeed a possible choice in the implementation of GLR parser 
generators:
    to sit GLR on top of LR(0).  I've never seen performance reports that 
compare
    both approach though.
    
    However, most implementations use SLR(1), because it is super simple to
    implement (especially compared to LALR(1)), yet make a significant 
difference.
    
    >> Yes, indeed.  That's something which Joel detailed in his PhD.  Note
    >> though that PSLR targets deterministic parsing.  Bison would (first)
    >> target that, not GLR.  But contributions are always welcome :)
    > 
    > The longer I think about it, the more I wonder whether it wouldn't make 
sense 
    > to make a clear and radical cut with the past when developing a 'married' 
    > parser to allow unfolding its full potential.
    > 
    > I mean I look at what you are working on Akim, and AFAICS the majority of 
time 
    > you have to spend on maintaining (i.e. not breaking) backward 
compatibility 
    > for ancient use cases. Most of Bison's conceptional design and internals 
are 
    > still based on requirements and opinions from around 1980 which simply no 
    > longer exist or proofed wrong.
    
    I don't subscribe to this opinion :)  The core of Bison is well established,
    and I have very little maintenance to do on it.  As a matter of fact, if you
    look at the recent history, there's hardly any work in the automaton
    construction.  Most of my recent efforts are about diagnostics.  Diagnostics
    about the grammar (which is also what the current work on counterexamples is
    about), and in the generated parsers (to provide better error messages).
    
    Many many people *want* a *deterministic* and *fast* parser.
    
    They want determinism because with GLR (with conflicts) you just don't know
    if some day your parser will fail because of an unforeseen ambiguity.  I
    know people who build critical software with Bison, and "possible failure"
    on valid input is not an option.  Bison is strong on this regard.  IELR
    is arguably the most powerful form of LR(1) parsing you can find on the 
market
    with the size of LR(0).  And you can have canonical LR(1) if you want.
    
    GLR is slow compared to LR.  People rely on our generated LR parsers being
    fast. That's a true feature of Bison's.  Note only does determinism provide
    people with guarantees ("your grammar is unambiguous"), it also provides
    them with efficiency.
    
    Although I don't have figures about "blind GLR" (i.e., seated on LR(0)),
    I'm pretty sure it would disastrous for anything serious.  So you'd have to
    go at least to SLR(1).  But then, what would be the point of throwing away
    all the good technology we have for LALR(1), IELR(1) and canonical LR(1)?
    
    If I were to build a GLR parser from scratch today, I would certainly
    aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett
    fought that fight some 40 years ago, and IELR(1) was certainly super tricky,
    but Joel E. Denny won that battle more than 10 years ago.  Why would I
    discard so great achievements?
    
    
    
    > Besides of what we already discussed, I think it would make sense to get 
rid 
    > of a bunch of other things that Bison still retains:
    > 
    > 1. No longer multiple parser types, only one: a somewhat GLR-alike parser 
like 
    >   discussed as real superset of LALR, but without lookahead token (so not 
    >   really LALR based under the hood). Theoretically there is no 
disadvantage 
    >   by doing so, because if a developer feeds it with a LALR(1) capable
    >   grammer, it would still end up having the same parsing 
    >   complexity=efficiency as an old-school LALR would do, probably just 
with 
    >   slightly higher memory footprint.
    > 
    >   Advantages though: much more user friendly and powerful, and a lot less 
    >   code to maintain.
    
    I bet we are talking about making the parsers at least one order of 
magnitude
    slower here (blind GLR).  Of course this not an option.
    
    Besides, if people want to look for brand new trendy techniques, why would
    they come to Bison?  There are plenty of alternatives out there, and if
    Bison still exists today, it's because there's already tons of software
    that rely on its current behavior.  I will *not* break that.  *Never*.
    That's our "raison d'être" (some English native should check my use of
    French expressions, I'm not I'm using it properly :-).
    
    I can add *more* options, provide better alternatives, but Bison cannot
    afford to drop features just like that.
    
    > 2. No longer raw C tables as pregenerated parser and scanner tables, 
instead 
    >   of such C-tables: an intermediate, dynamic in-memory model with 
high-level 
    >   API allowing to access and modify grammar and patterns on-the-fly.
    > 
    >   Advantage: it would allow self-extensible languages (e.g. an input that 
    >   adds keywords, patterns, drops or adds grammar rules while being 
parsed).
    >   And implementation specific things would finally clearly be isolated by 
    >   introducing such an API -> Easier to maintain.
    > 
    >   Disadvantages: Probably none, except of consuming slightly more memory 
for 
    >   meta information.
    
    This is interesting, but that's not Bison you are talking about.
    
    
    > 3. No longer aggressive compression of parser states (which is somewhat 
    >   already implied by [2.]), because such compressions lead to important 
    >   information loss when it comes to grammar debugging or auto completion 
    >   features and other user relevant purposes. The motivation that lead to 
such 
    >   compressions (very low RAM space) no longer exist in like >99% of use 
cases 
    >   today.
    
    Come on!  We all know that today the very same constraints ("be small") 
still apply,
    but indeed for completely different reasons.  Compressing used to be about
    fitting in the RAM, today it's about fitting in the cache.  Not to mention
    lower-end devices.
    
    I don't believe in the Unique True Omnipotent Parser Yielder (let's call it
    UTOPY for short).  There are far too many different use cases.  Some want
    to change the language dynamically, some want determinism, some what 
something
    DOM-oriented, not SAX-like, some have strong CPU/time constraints, some 
don't
    care, some experiment others need to be production ready, etc., etc., etc.
    
    Bison fits into a niche, and it would be an error to depart from this.
    
    
    > So these concepts are probably too radical for trying to still fit them 
into 
    > Bison's current (3.x) design.
    
    Don't expect that in Bison 4 either.
    
    
    >> Parsers such as SGLR have stacks which are graphs, while Bison's is
    >> always a tree.  All GLR parsers manage to share the past of concurrent
    >> stacks, they don't have a set of separate stacks.  Bison's GLR stacks,
    >> of course, share their common past, it's stored only once.  I believe
    >> SGLR is also able to share the common "future", so their stacks are
    >> no longer trees.  This allows SGLR to keep decent performances when 
parsing
    >> ambiguous grammars and you want to fetch the resulting parse forest.
    > 
    > Ok, I read it as Bison's current GLR parser probably not having seen much 
    > attention in the past.
    
    No, it's also about the targeted model.  Most other GLR parser generators
    address a somewhat simpler issue: their parsers generate the final AST.
    They are DOM-parsers.  Bison is more like SAX-parsers: we run user actions.
    Memory management is really a problem here.
    
    If you wish, Bison's GLR is tailored for "mostly deterministic grammars".
    It's point is more to grant us with infinite lookahead, rather than a
    means to cover ambiguous grammars. 
    
    At least that's the mental picture I have to Paul Hilfinger's work on
    GLR.  And AFAICT, there were never any complains about that.
    
    
    
    > BTW, from my perspective a set of parser stacks (with shared past) is IMO 
    > still a tree, however a tree with 3 dimensions instead of 2. A 'graph' 
for me 
    > is something much more generic, as a graph allows much more complex 
structures 
    > (with less restrictions) like traversing to any other state, which is not 
    > possible in a tree.
    
    We agree here.  You seem to be answering to something I did not mean, so
    I venture I was unclear.
    
    Bison's GLR stack is a tree.  SGLR's is a DAG.  Bison's GLR stacks
    only have forks, SGLR's also have joins.
    
    Cheers!

DOM parsing in bison (was: Parsing a language with optional spaces)

Reply via email to