[PATCH] D125231: [pseudo] Compile cxx grammar.

Sam McCall via Phabricator via cfe-commits Mon, 09 May 2022 13:09:02 -0700

sammccall added a comment.

Cool! High level thoughts:


**Motivation**
What's the win here? performance of loading/compiling the grammar? 
self-contained-ness? how much are we saving?
Doing the compilation at build time seems likely to be good for performance, 
but we should have numbers. And there are costs to doing it now vs in a few 
months, so there should be benefits too.

In terms of development, having everything grammar-related be compiled makes it 
harder to make changes to the grammar (e.g. constraints on rules, error hints, 
and whatever else we might want).
One way around this: start with a version where the public interface looks like 
this, but the impl is just compiling the strings in the background. (Probably a 
part of the generator that just embeds the lines of the grammar in a string 
array).Then opt-in the critical and stable bits at compile time, like the LR 
table build. Maybe eventually we'll have everything fully compiled, but it 
doesn't seem like the right balance yet.

**Build performance**
Compile time/ram is my biggest technical concern. We know that there's no good 
portable way to get blobs of data into the build system, and the parser is the 
bottleneck. (See the sad std::embed situation). 300KB should be manageable, 
with some work.
There are things we can do to mitigate:

- for the big lists, use simple top-level declarations of arrays with types, 
not nested expressions ending up as std::initializer_lists. Just parsing the 
elements is going to be slow, but there's no reason to involve overload 
resolution, force them to all be in memory at once, etc.
- split each big list into its own file for build-system level parallelism. In 
particular actions[] is ~half the data. This is easy once it's a top-level 
array.
- not parse-time, but related: switch the classes to holding ArrayRefs rather 
than Vectors so we don't have to copy everything. (`vector{1,2,3}` is always a 
copy, it can't share the underlying static because it's mutable).
- make the element types simpler to parse. Currently the Actions buffer is a 
`vector<class Action>` which at the bottom is two numbers packed into 16 bits. 
This could instead be a `uint16_t Actions[]={...};`, with LRTable wrapping the 
uint16_ts into Action objects on demand (at zero runtime cost).
- if we just have an array of 16-bit integers, it's possible making that source 
file C would be faster to parse than C++ (fewer language features). I doubt it 
but...

**Layering**

- the generator shouldn't depend on much, to avoid creating long paths in the 
build graph. It's pretty good now. I may be missing something, but does it need 
to link against clangBasic? If this is for functions like getTokenName() we 
could consider reimplementing them. (I think clangPseudoGrammar may be a better 
name than clangPseudoBasic).
- similarly, the generated code should avoid dependencies particularly if it's 
slow to compile - maybe it doesn't need clangBasic either? (For our purposes 
it'd also be nice to avoid depending on any tablegenned headers).

**Scope/generality**
I think we're should plan to run this generator in exactly one configuration 
(i.e. for our C++ grammar), we shouldn't pay the costs of making it efficient 
to reuse.
This means we don't need to generate the whole library with complete API 
surface, just the bits that really need to be generated. Generated code is 
harder to maintain and browse.

- in the header, this is just the symbol list (and later the rule list). the 
rest of the header can be hand-written.
- we'd want a cpp file for each array (maybe group some together), but none of 
the initialization logic that uses them: we can just forward declare and use 
those arrays from hand-written C++

Maybe later we want to be able to compile a different grammar for C and we 
revisit this (but I think equally likely we work out how to express language 
options in terms of dynamically-disabling grammar rules instead).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125231/new/

https://reviews.llvm.org/D125231

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D125231: [pseudo] Compile cxx grammar.

Reply via email to