Hi Akim, hi Christian, Really interesting thread and discussion! I almost feel guilty for only picking up one particular point ;)
Akim's comment > No, it's also about the targeted model. Most other GLR parser generators > address a somewhat simpler issue: their parsers generate the final AST. > They are DOM-parsers. Bison is more like SAX-parsers: we run user actions. > Memory management is really a problem here. touches an interesting point to me. It made me realize: We in Hyper are actually using bison only to generate a DOM tree. We built our own abstraction on top of bison, a "bison preprocessor" which translates a kind of ad-hoc "DOM-description language" used in the parser actions into actual C++ code for our actions which can then be fed into bison Would it be valuable to have something similar in bison? I.e. teach bison some kind of short-hand syntax which can be used in actions and to create a DOM tree? Would others also be interested in such a DOM-building capability? Cheers, Adrian On 18/07/2020, 08:47, "help-bison on behalf of Akim Demaille" <help-bison-bounces+avogelsgesang=tableau....@gnu.org on behalf of a...@lrde.epita.fr> wrote: > Le 14 juil. 2020 à 13:31, Christian Schoenebeck <schoeneb...@crudebyte.com> a écrit : > > On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote: > > For the 'married' parser, in the form imagined by me, there would be no > lookahead token, as a lookahead token implies a context-unaware scanner. > > Instead, when the parser would get to a state where it had to decide between > reducing or shifting (depending on a next token), it would not decide and fork > the parser state instead, in order to allow context-specific tokenization. That's indeed a possible choice in the implementation of GLR parser generators: to sit GLR on top of LR(0). I've never seen performance reports that compare both approach though. However, most implementations use SLR(1), because it is super simple to implement (especially compared to LALR(1)), yet make a significant difference. >> Yes, indeed. That's something which Joel detailed in his PhD. Note >> though that PSLR targets deterministic parsing. Bison would (first) >> target that, not GLR. But contributions are always welcome :) > > The longer I think about it, the more I wonder whether it wouldn't make sense > to make a clear and radical cut with the past when developing a 'married' > parser to allow unfolding its full potential. > > I mean I look at what you are working on Akim, and AFAICS the majority of time > you have to spend on maintaining (i.e. not breaking) backward compatibility > for ancient use cases. Most of Bison's conceptional design and internals are > still based on requirements and opinions from around 1980 which simply no > longer exist or proofed wrong. I don't subscribe to this opinion :) The core of Bison is well established, and I have very little maintenance to do on it. As a matter of fact, if you look at the recent history, there's hardly any work in the automaton construction. Most of my recent efforts are about diagnostics. Diagnostics about the grammar (which is also what the current work on counterexamples is about), and in the generated parsers (to provide better error messages). Many many people *want* a *deterministic* and *fast* parser. They want determinism because with GLR (with conflicts) you just don't know if some day your parser will fail because of an unforeseen ambiguity. I know people who build critical software with Bison, and "possible failure" on valid input is not an option. Bison is strong on this regard. IELR is arguably the most powerful form of LR(1) parsing you can find on the market with the size of LR(0). And you can have canonical LR(1) if you want. GLR is slow compared to LR. People rely on our generated LR parsers being fast. That's a true feature of Bison's. Note only does determinism provide people with guarantees ("your grammar is unambiguous"), it also provides them with efficiency. Although I don't have figures about "blind GLR" (i.e., seated on LR(0)), I'm pretty sure it would disastrous for anything serious. So you'd have to go at least to SLR(1). But then, what would be the point of throwing away all the good technology we have for LALR(1), IELR(1) and canonical LR(1)? If I were to build a GLR parser from scratch today, I would certainly aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett fought that fight some 40 years ago, and IELR(1) was certainly super tricky, but Joel E. Denny won that battle more than 10 years ago. Why would I discard so great achievements? > Besides of what we already discussed, I think it would make sense to get rid > of a bunch of other things that Bison still retains: > > 1. No longer multiple parser types, only one: a somewhat GLR-alike parser like > discussed as real superset of LALR, but without lookahead token (so not > really LALR based under the hood). Theoretically there is no disadvantage > by doing so, because if a developer feeds it with a LALR(1) capable > grammer, it would still end up having the same parsing > complexity=efficiency as an old-school LALR would do, probably just with > slightly higher memory footprint. > > Advantages though: much more user friendly and powerful, and a lot less > code to maintain. I bet we are talking about making the parsers at least one order of magnitude slower here (blind GLR). Of course this not an option. Besides, if people want to look for brand new trendy techniques, why would they come to Bison? There are plenty of alternatives out there, and if Bison still exists today, it's because there's already tons of software that rely on its current behavior. I will *not* break that. *Never*. That's our "raison d'être" (some English native should check my use of French expressions, I'm not I'm using it properly :-). I can add *more* options, provide better alternatives, but Bison cannot afford to drop features just like that. > 2. No longer raw C tables as pregenerated parser and scanner tables, instead > of such C-tables: an intermediate, dynamic in-memory model with high-level > API allowing to access and modify grammar and patterns on-the-fly. > > Advantage: it would allow self-extensible languages (e.g. an input that > adds keywords, patterns, drops or adds grammar rules while being parsed). > And implementation specific things would finally clearly be isolated by > introducing such an API -> Easier to maintain. > > Disadvantages: Probably none, except of consuming slightly more memory for > meta information. This is interesting, but that's not Bison you are talking about. > 3. No longer aggressive compression of parser states (which is somewhat > already implied by [2.]), because such compressions lead to important > information loss when it comes to grammar debugging or auto completion > features and other user relevant purposes. The motivation that lead to such > compressions (very low RAM space) no longer exist in like >99% of use cases > today. Come on! We all know that today the very same constraints ("be small") still apply, but indeed for completely different reasons. Compressing used to be about fitting in the RAM, today it's about fitting in the cache. Not to mention lower-end devices. I don't believe in the Unique True Omnipotent Parser Yielder (let's call it UTOPY for short). There are far too many different use cases. Some want to change the language dynamically, some want determinism, some what something DOM-oriented, not SAX-like, some have strong CPU/time constraints, some don't care, some experiment others need to be production ready, etc., etc., etc. Bison fits into a niche, and it would be an error to depart from this. > So these concepts are probably too radical for trying to still fit them into > Bison's current (3.x) design. Don't expect that in Bison 4 either. >> Parsers such as SGLR have stacks which are graphs, while Bison's is >> always a tree. All GLR parsers manage to share the past of concurrent >> stacks, they don't have a set of separate stacks. Bison's GLR stacks, >> of course, share their common past, it's stored only once. I believe >> SGLR is also able to share the common "future", so their stacks are >> no longer trees. This allows SGLR to keep decent performances when parsing >> ambiguous grammars and you want to fetch the resulting parse forest. > > Ok, I read it as Bison's current GLR parser probably not having seen much > attention in the past. No, it's also about the targeted model. Most other GLR parser generators address a somewhat simpler issue: their parsers generate the final AST. They are DOM-parsers. Bison is more like SAX-parsers: we run user actions. Memory management is really a problem here. If you wish, Bison's GLR is tailored for "mostly deterministic grammars". It's point is more to grant us with infinite lookahead, rather than a means to cover ambiguous grammars. At least that's the mental picture I have to Paul Hilfinger's work on GLR. And AFAICT, there were never any complains about that. > BTW, from my perspective a set of parser stacks (with shared past) is IMO > still a tree, however a tree with 3 dimensions instead of 2. A 'graph' for me > is something much more generic, as a graph allows much more complex structures > (with less restrictions) like traversing to any other state, which is not > possible in a tree. We agree here. You seem to be answering to something I did not mean, so I venture I was unclear. Bison's GLR stack is a tree. SGLR's is a DAG. Bison's GLR stacks only have forks, SGLR's also have joins. Cheers!