Thanks a lot for your detailed feedback!
I will rework my patch, especially to make the get_source_line_preprocessed 
function
more readable and more efficient.

Some comments in the mean time:

> This may sound silly, but please use "num" rather than "no" as an
> abbreviation for "number": to my under-caffeinated brain "no" reads
> as
> the opposite of "yes".

I fully agree, seems like 5 years ago my naming taste was a bit off.

> What about buffer[1]?  Does this have to be a ' '?
...
> Don't we also need a trailing '"' and a newline?

Ack, doesn't hurt to check buffer[1]. The newline might not come directly after 
the trailing '"' however, as
line markers can contain additional flags after the source file name. If we 
have:

# (\d+) "(.+)"

it should be safe to assume it's a line marker.

> Has this code been tested on large examples?  It looks to me like this
function does a linear scan starting at line 1 every time in the .i/.ii
file upwards looking for markers of interest;

Good points, I am looking into tracking line markers in the file_cache_slot.
I am currently testing a version with a simple vector for memoization.

FWIW we are using this patch in our internal gcc fork for building a pretty 
substantial C++ codebase and
have not seen build time regressions directly linked to it.
However, testing it against a fairly large .ii file that produces many warnings 
from that codebase (~400k LoC) shows 
significant speedup with the tracking version:

~/gcc-dev/bin/gcc TEST.cpp.ii  62.58s user 1.01s system 99% cpu 1:03.66 total # 
path as is
~/gcc-dev/bin/gcc TEST.cpp.ii  32.80s user 0.95s system 99% cpu 33.749 total # 
with line marker memoization

> This function is rather complicated

I agree. To be honest, looking at it after a couple of years was also not very 
pleasant.

> Something else that occurred to me: assuming the file in question has
gone through libcpp, we've already parsed the line markers, and
line_table will have a series of line map instances representing the
ranges in question.

According to some comments I found, it seems that for e.g the C front-end, it 
can happen that we start emitting diagnostics
before the line map has seen the end of the file, so it could probably only act 
as a hint?

Best
Lucas

-----Original Message-----
From: David Malcolm <dmalc...@redhat.com> 
Sent: Wednesday, 29 January 2025 16:17
To: Bader, Lucas <lucas.ba...@sap.com>; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH v2] get source line for diagnostic from preprocessed file 
[PR preprocessor/79106]

On Wed, 2025-01-29 at 09:16 -0500, David Malcolm wrote:
> On Wed, 2025-01-29 at 13:05 +0000, Bader, Lucas wrote:
> > Hi,
> > 
> > as discussed, I rebased and tested my patch against current master.
> > Additionally, it now has the Signed-off-by tag.
> > Looking forward to your comments.
> > 
> > Best
> > Lucas
> 
> Thanks for the updated patch.
> 
> Various notes inline throughout below...
> 
> > 
> > ----
> > 
> > Within a compile cluster, only the preprocessed output of GCC is
> > transferred
> > to remote nodes for compilation. When GCC produces advanced
> > diagnostics
> > (with -fdiagnostics-show-caret), e.g. prints out the affected
> > source
> > line and fixit hints, it attempts to read the source file again,
> > even
> > when
> > compiling a preprocessed file (-fpreprocessed). This leads to wrong
> > diagnostics when building with a compile cluster, or, more
> > generally,
> > when changing or deleting the original source file.
> > 
> > This patch alters GCC to read from the preprocessed file instead by
> > calculating the corresponding source line. This behavior is
> > consistent
> > with clang.
> > 
> > gcc/c-family/ChangeLog:
> > 
> >     * c-opts.cc (c_common_handle_option): pass -fpreprocessed
> >     option value to global diagnostic configuration
> > 
> > gcc/ChangeLog:
> > 
> >     * diagnostic-show-locus.cc
> > (layout::calculate_x_offset_display): read line from source or
> > preprocessed
> >     file based on -fpreprocessed value
> >     (source_line::source_line): read line from source or
> > preprocessed
> >     file based on -fpreprocessed value
> >     (layout_printer::print_line): read line from source or
> > preprocessed
> >     file based on -fpreprocessed value
> > 
> >     * diagnostic.cc (diagnostic_context::initialize):
> > initialize
> > new members
> >     * diagnostic.h: new members for reading
> >     source lines from preprocessed files
> > 
> >     * input.cc (file_cache::get_source_line_preprocessed): new
> > function
> >     to read source lines from preprocessed files
> >     (test_reading_source_line_preprocessed): new test case
> >     (input_cc_tests): execute new test case
> >     * input.h (class file_cache): add new member function
> > 
> >     * opts-global.cc (read_cmdline_options): pass input
> > filename
> > to global
> >     diagnostic context
> 
> 
> Some minor ChangeLog nits:
> * you should add "PR preprocessor/79106" (without quotes) to the
> start
> of each ChangeLog entry; have a look at the existing ChangeLog
> entries
> to see how we do it
> * the entries should read as sentences, so please capitalize the
> first
> letter, and add periods
> * please wrap at 74 columns.
> 
> You can check the patch using
>   ./contrib/gcc-changelog/git_check_commit.py
> (which is run as a server-side pre-commit hook, I believe)
> 
> 
> > 
> > Signed-off-by: Lucas Bader <lucas.ba...@sap.com>
> > ---
> >  gcc/c-family/c-opts.cc       |   1 +
> >  gcc/diagnostic-show-locus.cc |  25 ++++--
> >  gcc/diagnostic.cc            |   2 +
> >  gcc/diagnostic.h             |   6 ++
> >  gcc/input.cc                 | 169
> > +++++++++++++++++++++++++++++++++++
> >  gcc/input.h                  |   2 +
> >  gcc/opts-global.cc           |   4 +
> >  7 files changed, 204 insertions(+), 5 deletions(-)
> > 
> > diff --git a/gcc/c-family/c-opts.cc b/gcc/c-family/c-opts.cc
> > index 87b231861a6..105d8fa6eff 100644
> > --- a/gcc/c-family/c-opts.cc
> > +++ b/gcc/c-family/c-opts.cc
> > @@ -534,6 +534,7 @@ c_common_handle_option (size_t scode, const
> > char
> > *arg, HOST_WIDE_INT value,
> >  
> >      case OPT_fpreprocessed:
> >        cpp_opts->preprocessed = value;
> > +      global_dc->m_is_preprocessed = value;
> >        break;
> >  
> >      case OPT_fdebug_cpp:
> > diff --git a/gcc/diagnostic-show-locus.cc b/gcc/diagnostic-show-
> > locus.cc
> > index 898efe74acf..7ac183a126f 100644
> > --- a/gcc/diagnostic-show-locus.cc
> > +++ b/gcc/diagnostic-show-locus.cc
> > @@ -1786,8 +1786,13 @@ layout::calculate_x_offset_display ()
> >        return;
> >      }
> >  
> > -  const char_span line = m_file_cache.get_source_line
> > (m_exploc.file,
> > -                                                 
> > m_exploc.line);
> > +  char_span line (NULL, 0);
> > +  if (global_dc->m_main_input_file_path != NULL && global_dc-
> > > m_is_preprocessed)
> > +    line = m_file_cache.get_source_line_preprocessed (
> > +     m_exploc.file, global_dc->m_main_input_file_path,
> > m_exploc.line);
> > +  else
> > +    line = m_file_cache.get_source_line (m_exploc.file,
> > m_exploc.line);
> > +
> 
> Here and in two other places below in this file the patch adds logic
> of the form:
> 
>   if (global_dc->main_input_file_path != NULL && global_dc-
> >is_preprocessed)
>     line = location_get_source_line_preprocessed (some file,
>                                                   global_dc-
> >main_input_file_path,
>                                                 some line);
>   else
>     line = location_get_source_line (some file, some line);
> 
> Please introduce a subroutine for this to avoid the repetition,
> something like:
> 
> static char_span
> get_line_maybe_preprocessed (const char *filename, int line_num);
> 
> I wonder if this logic should be moved into the file_cache itself,
> but
> let's leave that for now.
> 
> 
> >    if (!line)
> >      {
> >        /* Nothing to do, we couldn't find the source line.  */
> > @@ -2780,7 +2785,12 @@ public:
> >  
> >  source_line::source_line (file_cache &fc, const char *filename,
> > int
> > line)
> >  {
> > -  char_span span = fc.get_source_line (filename, line);
> > +  char_span span (NULL, 0);
> > +  if (global_dc->m_main_input_file_path != NULL && global_dc-
> > > m_is_preprocessed)
> > +    span = fc.get_source_line_preprocessed (
> > +   filename, global_dc->m_main_input_file_path, line);
> > +  else
> > +    span = fc.get_source_line (filename, line);
> 
> Please use the subroutine here.
> 
> 
> >    chars = span.get_buffer ();
> >    width = span.length ();
> >  }
> > @@ -3132,8 +3142,13 @@ layout_printer::show_ruler (int max_column)
> >  void
> >  layout_printer::print_line (linenum_type row)
> >  {
> > -  char_span line
> > -    = m_layout.m_file_cache.get_source_line
> > (m_layout.m_exploc.file,
> > row);
> > +  char_span line (NULL, 0);
> > +  if (global_dc->m_main_input_file_path != NULL && global_dc-
> > > m_is_preprocessed)
> > +    line = m_layout.m_file_cache.get_source_line_preprocessed (
> > +   m_layout.m_exploc.file, global_dc->m_main_input_file_path,
> > row);
> > +  else
> > +    line = m_layout.m_file_cache.get_source_line
> > (m_layout.m_exploc.file, row);
> > +
> 
> Again, please use the subroutine here.
> 
> 
> >    if (!line)
> >      return;
> >  
> > diff --git a/gcc/diagnostic.cc b/gcc/diagnostic.cc
> > index c2f6714c24a..c8a93d407ad 100644
> > --- a/gcc/diagnostic.cc
> > +++ b/gcc/diagnostic.cc
> > @@ -290,6 +290,8 @@ diagnostic_context::initialize (int n_opts)
> >    m_diagrams.m_theme = nullptr;
> >    m_original_argv = nullptr;
> >    m_diagnostic_buffer = nullptr;
> > +  m_main_input_file_path = nullptr;
> > +  m_is_preprocessed = false;
> >  
> >    enum diagnostic_text_art_charset text_art_charset
> >      = DIAGNOSTICS_TEXT_ART_CHARSET_EMOJI;
> > diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
> > index 202760b2f85..60a7900eb23 100644
> > --- a/gcc/diagnostic.h
> > +++ b/gcc/diagnostic.h
> > @@ -850,6 +850,12 @@ public:
> >    /* True if warnings should be given in system headers.  */
> >    bool m_warn_system_headers;
> >  
> > +  /* Original input file.  */
> > +  const char* m_main_input_file_path;
> > +
> > +  /* True if the input file is treated as preprocessed.  */
> > +  bool m_is_preprocessed;
> > +
> >  private:
> >    /* Maximum number of errors to report.  */
> >    int m_max_errors;
> > diff --git a/gcc/input.cc b/gcc/input.cc
> > index 9f3cc6651e8..c261f2c4750 100644
> > --- a/gcc/input.cc
> > +++ b/gcc/input.cc
> > @@ -1056,6 +1056,109 @@ file_cache::get_source_line (const char
> > *file_path, int line)
> >    return char_span (buffer, len);
> >  }
> >  
> > +/* Return the physical source line that corresponds to
> > SOURCE_NAME/LINE.
> > +   Read the line from the preprocessed file at FILE_PATH.
> > +   The line is not nul-terminated.  The returned pointer is only
> > +   valid until the next call of get_source_line.
> > +   Note that the line can contain several null characters,
> > +   so the returned value's length has the actual length of the
> > line.
> > +   If the function fails, a NULL char_span is returned.  */
> > +
> > +char_span
> > +file_cache::get_source_line_preprocessed (const char *source_name,
> > +                                  const char *file_path, int
> > line)
> > +{
> > +  char *buffer = NULL;
> > +  ssize_t len;
> > +
> > +  if (line == 0)
> > +    return char_span (NULL, 0);
> > +
> > +  file_cache_slot *c = lookup_or_add_file (file_path);
> > +  if (c == NULL)
> > +    return char_span (NULL, 0);
> > +
> > +  // find linemarker closest to actual line number
> > +  int linemarker_line = 0; // line no of linemarker in pp file
> > +  int linemarker_loc = 0;  // source line no indicated in
> > linemarker
> 
> This may sound silly, but please use "num" rather than "no" as an
> abbreviation for "number": to my under-caffeinated brain "no" reads
> as
> the opposite of "yes".
> 
> 
> > +  int current_line = 1;
> > +  ssize_t source_name_len = (ssize_t) strlen (source_name);
> > +  while (c->read_line_num (current_line, &buffer, &len))
> > +    {
> > +      if (len > 2 && buffer[0] == '#' && ISDIGIT (buffer[2]))
> 
> What about buffer[1]?  Does this have to be a ' '?
> 
> > +   {
> > +     // is a linemarker
> > +     bool is_match = false;
> > +     for (ssize_t i = 3; i < len && source_name_len < len - i
> > -
> > 1; ++i)
> > +       {
> > +         if (buffer[i] == '"')
> > +           {
> > +             is_match
> > +                 = memcmp (buffer + i + 1, source_name,
> > source_name_len)
> > +                   == 0;
> 
> Don't we also need a trailing '"' and a newline?
> 
> > +             break;
> > +           }
> > +       }
> > +     bool match_past_line = false;
> > +     if (is_match)
> > +       {
> > +         // we found a linemarker that matches the source
> > file
> > +         // e.g.
> > +         // # 14 "../../file.h"
> > +         // we use the last suitable marker we find
> > +         int loc = atoi (buffer + 2);
> 
> This function is rather complicated.  I think a useful simplification
> would be to split out the above logic into a subroutine, something
> like
> this:
> 
> /* Return true iff the text LINE is a line marker of the form
>    "# ([0-9]+) "(.+)"\n"
>    where the quoted string (group 2) matches SOURCE_NAME.
>    If returning true, write the line number (group 1) to *OUT_LINE. 
> */
> 
> static bool
> is_line_marker_for_file_p (char_span line,
>                            char_span source_name,
>                          int *out_line_num)
> {
>   // etc, the loop to determine is_match and the atoi
> }
> 
> and then this subroutine could be unit-tested as well.
> 
> (though despite the leading comment I wrote, I'm not sure if "line"
> should include the trailing newline)
> 
> 
> > +         if (loc <= line && line - loc <= line -
> > linemarker_loc)
> > +           {
> 
> Sorry, I confess my eyes started glazing over at this point and I
> feel
> I'm probably missing things.  I think splitting out the subroutine to
> find the match will help readability.
> 
> Has this code been tested on large examples?  It looks to me like
> this
> function does a linear scan starting at line 1 every time in the
> .i/.ii
> file upwards looking for markers of interest; with a call to
> read_line_num (current_line_num in a loop.  These calls could be non-
> trivial (see Andi's recent patches).  I haven't looked in detail at
> the
> time-complexity, but this *might* introduce O(n^2) behavior or worse.
> So we might want to have some extra data structure to track line
> markers to amortize the cost - though I'm not sure at this point.

Something else that occurred to me: assuming the file in question has
gone through libcpp, we've already parsed the line markers, and
line_table will have a series of line map instances representing the
ranges in question.  So maybe there's a way to do things using that.

Dave

Reply via email to