On Tue, Sep 03, 2024 at 03:30:05PM +0100, Iain Sandoe wrote:
> 
> 
> > On 3 Sep 2024, at 13:59, Nathaniel Shead <nathanielosh...@gmail.com> wrote:
> > 
> > On Tue, Sep 03, 2024 at 10:14:29AM +0100, Iain Sandoe wrote:
> >> Hi Folks,
> >> 
> >> When we build a C++ binary module (CMI/BMI), we obviously have access to 
> >> its source to produce diagnostics, all fine.
> >> 
> >> However, when we consume that module we might also need access to the 
> >> sources used to build it - since diagnostics triggered in the consumer can 
> >> refer back to the sources used.
> >> 
> >> Currently clang has been experimenting with embedding the sources into the 
> >> BMI - this can make things seem more efficient when, for example, 
> >> distributing BMIs to remote nodes in a large-scale distributed build.
> >> 
> >> There was a patch proposed to make this the default for clang, which has 
> >> resulted in the discussion here:
> >> 
> >> https://discourse.llvm.org/t/rfc-modules-should-we-embed-sources-to-the-bmi/81029
> >> 
> >> Does GCC have a plan to deal with this?
> >> .. if so can we communicate it..
> >> .. if not, then what do we think about this strategy?
> >> .. at the very least trying to avoid tooling divergence seems a worthwhile 
> >> objective.
> >> 
> >> thanks
> >> Iain
> >> 
> > 
> > I've thought a little bit about this in the past but I don't currently
> > think embedding source files by default would be a great idea.
> > 
> > For GCC, the main benefit of embedding the original source would be that
> > currently, if you compile a module interface, then edit the original
> > source file before compiling a TU that consumes that source file,
> > diagnostics referring to the original module interface may appear
> > "incorrect" and quote unrelated code or point to incorrect lines, which
> > can potentially be confusing:
> > 
> > x.cpp: In function ‘int main()’:
> > x.cpp:4:7: error: invalid conversion from ‘const char*’ to ‘int’ 
> > [-fpermissive]
> >    4 |   foo("hello");
> >      |       ^~~~~~~
> >      |       |
> >      |       const char*
> > In module M, imported at x.cpp:1:
> > m.cpp:2:17: note:   initialising argument 1 of ‘void foo@M(int)’
> >    2 | hello this file is edited
> >      |                 ^~~
> 
> I would expect, in most build systems, that any change in the BMI sources
> would cause a rebuild of the BMI itself - anything else is not actually safe?
> 

Right; this mostly only comes up if you're compiling 'by hand' and
forget to recompile a dependent module, which as you say is not safe to
start with, hence why I don't think it's much of an issue.  I mostly
bring this up because it can be quite confusing if it does happen.

> > This is the same issue you might get with listing code using GDB on a
> > binary that is older than the source files it was built with; GDB
> > currently provides a nice warning when this happens that the source file
> > may be out of date, that I imagine could be helpful in this circumstance
> > as well.  But this would not require embedding any source files, just
> > comparing the mtime of the CMI vs. the source and/or header file in
> > question.
> > 
> > Either way I don't think this is much of an issue in practice: I would
> > expect users to typically be using build systems that will ensure this
> > above situation never occurs to begin with, and even in cases where the
> > source files are completely unavailable diagnostics still work fine as
> > it is, just without quoting of the source code (though line:column
> > information is still available).
> > 
> > And on the flip side, embedding source files could grow the CMIs quite a
> > lot; naively just embedding all source files for something like
> > 
> >  module;
> >  #include <string>
> >  export module M;
> >  export std::string foo();
> > 
> > would presumably also require embedding the entire contents of <string>
> > and all headers that it depends on, recursively.  And this is quite
> > apart from the potential questions around IP mentioned in the thread you
> > linked.
> 
> There is some (anecdotal) evidence that for current AST representation on
> clang, at least, that the BMI size does not grow much.  See discourse 
> discussion
> for cross references.
> 

With the above (admittedly somewhat contrived) example, the CMI that GCC
emits is currently 2381 KB, and storing the (uncompressed) contents of
<string> with '-E -fdirectives-only' as a guesstimate for embedding
would add 1736 KB, for an ~70% increase in size.  Obviously as the
proportion of "useful work" in a module interface grows in comparison to
the amount of code included via headers this ratio would improve.

That said, this would not be an issue in this particular case once
'import std' becomes available, and using a header unit would not have
this issue either (only the header unit CMI would need to embed these
source files rather than every CMI including the header), so perhaps
this is not a large problem?

Yours,
Nathaniel

> cheers
> Iain
> 
> > 
> > Yours,
> > Nathaniel
> 

Reply via email to